Skip to content

Conversation

@bmr-cymru
Copy link
Contributor

@bmr-cymru bmr-cymru commented Dec 20, 2025

Implement a best-effort file type guesser when include_file_type is disabled and add "hard exclude" paths that are always skipped even if include_system_dirs=True (e.g. /proc/kmsg). Also extend the MIME-type based categorization to cover more types and add new FileTypeCategory enum variants.

Resolves: #801
Resolves: #802
Resolves: #803
Resolves: #804
Resolves: #805
Resolves: #806
Resolves: #807
Resolves: #808
Resolves: #809
Resolves: #819

Summary by CodeRabbit

  • New Features

    • File‑type reporting now includes human‑readable descriptions (e.g. "filesystem directory", "symbolic link").
  • Improvements

    • File‑type detection gains a non‑magic fallback for environments without the magic library.
    • Diff outputs use clearer field names; per‑record lines show file type and description.
    • Scans apply an expanded, always‑excluded set of system/device patterns.
    • Entry formatting shows stat info on a separate line.
    • Unknown files now default to a binary‑style content diff.
  • Chores

    • Public CLI/config option renamed to use_magic_file_type.
  • Tests

    • Tests updated for renamed fields and explicit use_magic detection flag.

✏️ Tip: You can customize this high-level summary in your review settings.

@bmr-cymru bmr-cymru self-assigned this Dec 20, 2025
@bmr-cymru bmr-cymru added bug Something isn't working enhancement New feature or request UI/UX User interface and experience DifferenceEngine labels Dec 20, 2025
@coderabbitai
Copy link

coderabbitai bot commented Dec 20, 2025

Walkthrough

Rename include_file_type → use_magic_file_type across CLI and options; add best‑effort non‑magic file‑type guessing and many text/binary patterns; always‑exclude specific filesystem paths during tree walking; add human‑readable file_type_desc to FsDiffRecord and adjust short/full output fields and tests. (48 words)

Changes

Cohort / File(s) Summary
Option renaming
scripts/difftest.py, snapm/command.py, snapm/fsdiff/options.py
CLI -f/--file-types dest and DiffOptions field renamed from include_file_type to use_magic_file_type. Flag behaviour unchanged; callers now read use_magic_file_type.
File type detection expansion
snapm/fsdiff/filetypes.py
Added extensive TEXT/BINARY extension & filename maps, path hints and SYSTEMD_UNIT_EXTENSIONS; new guessing utilities (_generic_guess_file, _guess_text_file, _guess_binary_file, _guess_file); FileTypeCategory adds SOURCE_CODE, CERTIFICATE, SYMLINK; FileTypeDetector.detect_file_type(self, file_path, use_magic=False) and _guess_file_type() implemented.
Tree walk / exclusions / FsEntry formatting
snapm/fsdiff/treewalk.py
Introduced _ALWAYS_EXCLUDE_PATTERNS merged with user excludes; treewalk always computes file type for files and directories using detect_file_type(Path(...), use_magic=options.use_magic_file_type); FsEntry.__str__() renders stat block on a new line.
Output / reporting
snapm/fsdiff/engine.py
Added file_type_desc attribute and _get_file_type_desc() to FsDiffRecord; included file_type_desc in to_dict() and __str__(); FsDiffResults.short() updated to use content_diff_summary, changes, and include per‑record diff_type, file_type, file_type_desc.
Content diff fallback
snapm/fsdiff/contentdiff.py
Changed fallback for unknown/None file_type_info from TextContentDiffer to BinaryContentDiffer.
Tests updated
tests/fsdiff/test_engine.py, tests/fsdiff/test_filetypes.py
Tests now call detect_file_type(..., use_magic=True) where relevant and assert renamed output fields (diff_type, content_diff_summary, changes, file_type_desc).
CLI helper
scripts/difftest.py
-f/--file-types now sets dest='use_magic_file_type' (was include_file_type).

Sequence Diagram(s)

mermaid
sequenceDiagram
autonumber
actor User
participant CLI as CLI (-f/--file-types)
participant Command as snapm.command
participant Treewalk as snapm.fsdiff.treewalk
participant Detector as snapm.fsdiff.filetypes
participant Engine as snapm.fsdiff.engine
User->>CLI: runs diff with -f flag
CLI->>Command: parse args (use_magic_file_type)
Command->>Treewalk: start walk(options.use_magic_file_type)
Treewalk->>Detector: detect_file_type(path, use_magic=options.use_magic_file_type)
Detector-->>Treewalk: FileTypeInfo (guessed or magic)
Treewalk->>Engine: emit FsEntry with file_type_info
Engine->>Engine: compute file_type_desc via _get_file_type_desc()
Engine-->>User: formatted result (short/full with file_type_desc)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Pay attention to snapm/fsdiff/filetypes.py rule ordering and mapping completeness.
  • Verify CLI→Command→Options→Treewalk→Detector propagation of use_magic_file_type.
  • Review _ALWAYS_EXCLUDE_PATTERNS for accidental over‑exclusion and platform‑specific paths.
  • Check new file_type_desc handling for None/edge cases and updated tests.

Possibly related issues

  • #801 — Implements best‑effort file type detection and always‑exclude paths; this PR adds non‑magic guessing and _ALWAYS_EXCLUDE_PATTERNS, addressing the objective.
  • #802 — Adds patterns for file type identification; expanded TEXT/BINARY maps and path hints in filetypes.py match this issue.
  • #803 — Introduces use_magic: bool and _guess_file_type() in FileTypeDetector; the PR implements both.
  • #804 — Add hard path exclusions to treewalk; _ALWAYS_EXCLUDE_PATTERNS implements this.
  • #805 — Add new FileTypeCategory variants; SOURCE_CODE, CERTIFICATE, SYMLINK were added.
  • #806 — Expand category_rules MIME mapping; the PR extends categorization rules and mappings.
  • #807 — Fix FsEntry.__str__() formatting; the stat block line change addresses this.
  • #808 — Make formatting consistent and add file_type_desc; FsDiffRecord now includes file_type_desc and short/full outputs updated.
  • #809 — Rename include_file_typeuse_magic_file_type; CLI and option renames match this issue.

Possibly related PRs

"I hopped through maps of text and bytes,
I sniffed the paths both day and night,
I skip the noisy, I name each type,
I line the stats and keep excludes tight,
A tidy diff — I thump with delight!"

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main changes: best-effort file type detection and always-exclude path implementation.
Linked Issues check ✅ Passed All linked issues are addressed: file type guesser (#801), patterns added (#802), use_magic parameter and _guess_file_type() method (#803), hard exclusions in treewalk (#804), new FileTypeCategory variants (#805), expanded category_rules (#806), FsEntry stat formatting (#807), file_type_desc field and consistent formatting (#808), and include_file_type renamed to use_magic_file_type (#809).
Out of Scope Changes check ✅ Passed All changes are directly aligned with linked issues: CLI argument renaming, file type detection logic, exclusion patterns, field additions, and output formatting updates.
Docstring Coverage ✅ Passed Docstring coverage is 82.14% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch bmr-extended-file-types

📜 Recent review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e4ee2bf and 1887199.

📒 Files selected for processing (1)
  • snapm/fsdiff/contentdiff.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 541
File: snapm/manager/_mounts.py:174-197
Timestamp: 2025-10-22T20:06:54.993Z
Learning: For snapshotmanager/snapm follow-up requests from bmr-cymru, create a GitHub issue with label "enhancement", set/ensure a "type: Task" label, and assign it to bmr-cymru. Include backlinks to the PR and the discussion comment in the issue body.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (37)
  • GitHub Check: rpm-build:fedora-43-x86_64
  • GitHub Check: rpm-build:centos-stream-10-x86_64
  • GitHub Check: rpm-build:epel-9-x86_64
  • GitHub Check: rpm-build:fedora-42-x86_64
  • GitHub Check: rpm-build:centos-stream-9-x86_64
  • GitHub Check: rpm-build:centos-stream-9-x86_64
  • GitHub Check: rpm-build:centos-stream-10-x86_64
  • GitHub Check: rpm-build:fedora-42-x86_64
  • GitHub Check: rpm-build:fedora-rawhide-x86_64
  • GitHub Check: rpm-build:epel-9-x86_64
  • GitHub Check: rpm-build:fedora-43-x86_64
  • GitHub Check: rpm-build:fedora-43-x86_64
  • GitHub Check: rpm-build:fedora-rawhide-x86_64
  • GitHub Check: rpm-build:centos-stream-10-x86_64
  • GitHub Check: rpm-build:fedora-42-x86_64
  • GitHub Check: rpm-build:centos-stream-9-x86_64
  • GitHub Check: rpm-build:epel-9-x86_64
  • GitHub Check: rpm-build:fedora-43-x86_64
  • GitHub Check: rpm-build:centos-stream-9-x86_64
  • GitHub Check: rpm-build:epel-9-x86_64
  • GitHub Check: rpm-build:fedora-42-x86_64
  • GitHub Check: rpm-build:fedora-rawhide-x86_64
  • GitHub Check: rpm-build:centos-stream-10-x86_64
  • GitHub Check: virt_tests (bios, lvm-thin, fedora41)
  • GitHub Check: virt_tests (uefi, lvm, centos-stream10)
  • GitHub Check: virt_tests (bios, lvm-thin, centos-stream10)
  • GitHub Check: virt_tests (uefi, lvm-thin, centos-stream10)
  • GitHub Check: virt_tests (bios, lvm-thin, centos-stream9)
  • GitHub Check: virt_tests (bios, lvm, centos-stream9)
  • GitHub Check: virt_tests (uefi, lvm-thin, fedora42)
  • GitHub Check: virt_tests (uefi, lvm-thin, centos-stream9)
  • GitHub Check: virt_tests (bios, lvm-thin, fedora42)
  • GitHub Check: virt_tests (uefi, lvm, fedora42)
  • GitHub Check: virt_tests (uefi, lvm, fedora41)
  • GitHub Check: virt_tests (bios, lvm, centos-stream10)
  • GitHub Check: virt_tests (uefi, lvm, centos-stream9)
  • GitHub Check: python-basic-tests
🔇 Additional comments (1)
snapm/fsdiff/contentdiff.py (1)

615-615: The fallback behaviour change for unknown file types is safe and effectively unreachable in practice.

This change switches the fallback differ from TextContentDiffer to BinaryContentDiffer when file_type_info is None. However, examination of the code reveals that detect_file_type() always returns a FileTypeInfo object and never returns None. The _guess_file() function also always returns a 3-tuple—it has a fallback case that returns ("application/octet-stream", "unknown file type", "binary") rather than None.

The best-effort guesser comprehensively covers common text patterns: TEXT_EXTENSION_MAP includes over 100 extensions (.txt, .log, .conf, .yaml, .json, .csv, .html, .js, .service, etc.) and TEXT_FILENAME_MAP covers extensionless files (e.g. *readme, *makefile, *license, *fstab). Since file_type_info will never be None in normal operation, the condition triggering this fallback is unreachable, and there is no UX impact.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@packit-as-a-service
Copy link

Congratulations! One of the builds has completed. 🍾

You can install the built RPMs by following these steps:

  • sudo dnf install -y 'dnf*-command(copr)'
  • dnf copr enable packit/snapshotmanager-snapm-810
  • And now you can install the packages.

Please note that the RPMs should be used only in a testing environment.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tests/fsdiff/test_filetypes.py (2)

70-78: Test no longer exercises the magic error handling path.

With the signature change to detect_file_type(file_path, use_magic=False), this test now defaults to the guessing path rather than the magic path. The mock on magic.detect_from_filename won't be triggered, and the test passes for the wrong reason.

🔎 Proposed fix
     @unittest.skipIf(not hasattr(magic, "error"), "magic does not have magic.error")
     @patch("snapm.fsdiff.filetypes.magic.detect_from_filename")
     def test_detect_file_type_error(self, mock_magic):
         # Simulate magic library error
         mock_magic.side_effect = magic.error("magic failed")
 
         path = Path("/broken")
-        info = self.detector.detect_file_type(path)
+        info = self.detector.detect_file_type(path, use_magic=True)
 
         self.assertEqual(info.category, FileTypeCategory.BINARY)
         self.assertEqual(info.mime_type, "application/octet-stream")

24-36: Other tests using mocked magic also need use_magic=True.

The tests test_categorize_log_file and test_categorize_json_config mock magic.detect_from_filename but call detect_file_type without use_magic=True. These tests will use the guessing path instead, making the mocks ineffective.

🔎 Proposed fix
     @patch("snapm.fsdiff.filetypes.magic.detect_from_filename")
     def test_categorize_log_file(self, mock_magic):
         # magic might say plain text, but filename says .log
         mock_res = MagicMock()
         mock_res.mime_type = "text/plain"
         mock_res.name = "ASCII text"
         mock_res.encoding = "us-ascii"
         mock_magic.return_value = mock_res
 
         path = Path("/var/log/syslog")
-        info = self.detector.detect_file_type(path)
+        info = self.detector.detect_file_type(path, use_magic=True)
 
         self.assertEqual(info.category, FileTypeCategory.LOG)
 
     @patch("snapm.fsdiff.filetypes.magic.detect_from_filename")
     def test_categorize_json_config(self, mock_magic):
         mock_res = MagicMock()
         mock_res.mime_type = "application/json"
         mock_res.name = "JSON data"
         mock_res.encoding = "us-ascii"
         mock_magic.return_value = mock_res
 
         path = Path("/etc/app/config.json")
-        info = self.detector.detect_file_type(path)
+        info = self.detector.detect_file_type(path, use_magic=True)
 
         self.assertEqual(info.category, FileTypeCategory.CONFIG)
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3feeadd and 516e7b5.

📒 Files selected for processing (8)
  • scripts/difftest.py (1 hunks)
  • snapm/command.py (1 hunks)
  • snapm/fsdiff/engine.py (5 hunks)
  • snapm/fsdiff/filetypes.py (5 hunks)
  • snapm/fsdiff/options.py (1 hunks)
  • snapm/fsdiff/treewalk.py (5 hunks)
  • tests/fsdiff/test_engine.py (1 hunks)
  • tests/fsdiff/test_filetypes.py (1 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 541
File: snapm/manager/_mounts.py:174-197
Timestamp: 2025-10-22T20:06:54.993Z
Learning: For snapshotmanager/snapm follow-up requests from bmr-cymru, create a GitHub issue with label "enhancement", set/ensure a "type: Task" label, and assign it to bmr-cymru. Include backlinks to the PR and the discussion comment in the issue body.
📚 Learning: 2025-12-02T16:08:02.588Z
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 0
File: :0-0
Timestamp: 2025-12-02T16:08:02.588Z
Learning: In the snapshotmanager/snapm codebase, `FileTypeInfo` in `snapm/_fsdiff/filetypes.py` is not suitable for conversion to `dataclass` because it requires custom `__init__` logic to compute `is_text_like` from the `category` parameter, and has a custom `__str__` method that formats field names as human-readable strings (e.g., "MIME type:"). The project uses `dataclass` extensively for classes like `SnapmConfig` and `GcPolicyParams*`, but only where it doesn't interfere with custom initialization or formatting requirements.

Applied to files:

  • snapm/fsdiff/options.py
  • snapm/fsdiff/filetypes.py
  • snapm/fsdiff/engine.py
  • tests/fsdiff/test_filetypes.py
📚 Learning: 2025-12-14T12:52:14.459Z
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 707
File: snapm/fsdiff/engine.py:327-405
Timestamp: 2025-12-14T12:52:14.459Z
Learning: In snapm/fsdiff/engine.py, render_unified_diff intentionally uses tc.WHITE instead of tc.NORMAL to reset colors after diff lines. This avoids breaking less -R output when piping through files, and is intended to support --color=always with less -R. When reviewing changes in this file, ensure any color reset logic preserves compatibility with downstream pagers/filters; if you introduce color resets, test with --color=always and piping to less -R. If you modify reset color behavior, consider updating tests and documenting rationale.

Applied to files:

  • snapm/fsdiff/engine.py
📚 Learning: 2025-09-24T02:49:21.752Z
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 490
File: snapm/manager/plugins/lvm2.py:618-629
Timestamp: 2025-09-24T02:49:21.752Z
Learning: When API changes affect return types (like changing from int to tuple), all supporting infrastructure needs updating including mock scripts in tests, not just the call sites that consume the API.

Applied to files:

  • tests/fsdiff/test_filetypes.py
📚 Learning: 2025-12-10T18:28:51.721Z
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 0
File: :0-0
Timestamp: 2025-12-10T18:28:51.721Z
Learning: In the snapshotmanager/snapm codebase, `TreeWalker.walk_tree()` in `snapm/_fsdiff/treewalk.py` eagerly materializes all paths before processing (using `os.walk`) to compute the total count for progress reporting. This design is intentional and necessary because accurate progress updates require knowing the total upfront. The custom `snapm._progress` implementation is lightweight (~2000 lines with tests) and RHEL-compatible, unlike alternatives like `tqdm` (>8000 lines, not in RHEL). The eager materialization trade-off is well-justified for the UX benefit of progress reporting on large trees.

Applied to files:

  • snapm/fsdiff/treewalk.py
🧬 Code graph analysis (3)
snapm/fsdiff/engine.py (1)
snapm/fsdiff/treewalk.py (2)
  • is_dir (310-318)
  • is_symlink (321-329)
snapm/fsdiff/treewalk.py (1)
snapm/fsdiff/filetypes.py (1)
  • detect_file_type (738-771)
tests/fsdiff/test_engine.py (1)
snapm/fsdiff/engine.py (1)
  • short (650-679)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (34)
  • GitHub Check: rpm-build:centos-stream-9-x86_64
  • GitHub Check: rpm-build:centos-stream-10-x86_64
  • GitHub Check: rpm-build:fedora-42-x86_64
  • GitHub Check: rpm-build:fedora-rawhide-x86_64
  • GitHub Check: rpm-build:epel-9-x86_64
  • GitHub Check: rpm-build:fedora-43-x86_64
  • GitHub Check: rpm-build:fedora-rawhide-x86_64
  • GitHub Check: rpm-build:fedora-42-x86_64
  • GitHub Check: rpm-build:epel-9-x86_64
  • GitHub Check: rpm-build:fedora-43-x86_64
  • GitHub Check: rpm-build:centos-stream-9-x86_64
  • GitHub Check: rpm-build:centos-stream-10-x86_64
  • GitHub Check: rpm-build:fedora-43-x86_64
  • GitHub Check: rpm-build:centos-stream-9-x86_64
  • GitHub Check: rpm-build:epel-9-x86_64
  • GitHub Check: rpm-build:fedora-42-x86_64
  • GitHub Check: rpm-build:fedora-rawhide-x86_64
  • GitHub Check: rpm-build:centos-stream-10-x86_64
  • GitHub Check: virt_tests (uefi, lvm-thin, fedora42)
  • GitHub Check: virt_tests (bios, lvm-thin, fedora42)
  • GitHub Check: virt_tests (uefi, lvm, fedora41)
  • GitHub Check: virt_tests (uefi, lvm, centos-stream10)
  • GitHub Check: virt_tests (uefi, lvm, centos-stream9)
  • GitHub Check: virt_tests (uefi, lvm-thin, fedora41)
  • GitHub Check: virt_tests (bios, lvm-thin, centos-stream9)
  • GitHub Check: virt_tests (uefi, lvm-thin, centos-stream9)
  • GitHub Check: virt_tests (bios, lvm, centos-stream10)
  • GitHub Check: virt_tests (uefi, lvm-thin, centos-stream10)
  • GitHub Check: virt_tests (bios, lvm, centos-stream9)
  • GitHub Check: virt_tests (bios, lvm-thin, fedora41)
  • GitHub Check: virt_tests (bios, lvm-thin, centos-stream10)
  • GitHub Check: virt_tests (uefi, lvm, fedora42)
  • GitHub Check: virt_tests (bios, lvm, fedora41)
  • GitHub Check: python-basic-tests
🔇 Additional comments (19)
snapm/fsdiff/filetypes.py (4)

32-242: LGTM! Comprehensive text extension mapping.

The extensive mapping covers a wide range of text-based file types including documentation, configuration, web standards, scripting, source code, and more. The consistent format with MIME types and lowercase descriptions is well-structured.


579-581: LGTM! New FileTypeCategory enum variants.

The new SOURCE_CODE, CERTIFICATE, and SYMLINK categories align with the PR objectives and provide more granular file type categorisation.


738-771: LGTM! Well-structured dual-path detection.

The detect_file_type method now cleanly handles both magic-based and guessing-based detection paths. Error handling for the magic path is robust with platform-specific exception handling for c9s compatibility.


801-815: LGTM! Clean delegation to guessing helper.

The _guess_file_type method properly wraps the module-level _guess_file function and integrates with the existing categorisation logic.

scripts/difftest.py (1)

28-34: LGTM! Consistent rename of argument destination.

The dest parameter rename from include_file_type to use_magic_file_type aligns with the corresponding change in DiffOptions and ensures from_cmd_args correctly maps the CLI flag to the dataclass field.

tests/fsdiff/test_engine.py (2)

73-76: LGTM! Test assertions updated to match new output format.

The assertions now correctly expect the updated output field names (diff_type: and content_diff_summary:) as reflected in the FsDiffResults.short() method shown in the relevant code snippet.


245-253: LGTM! FsDiffRecord string representation test updated.

The test correctly reflects the updated __str__ output format with diff_type: modified and metadata_changed: fields.

snapm/fsdiff/options.py (1)

43-44: LGTM! Option renamed to clarify magic-based detection.

The rename from include_file_type to use_magic_file_type better communicates the behaviour - when True, the libmagic library is used for detection; when False (default), the best-effort guessing path is used. This aligns with PR objective #809.

snapm/command.py (1)

2718-2718: LGTM! Clean rename aligning with updated option naming.

The destination variable rename from include_file_type to use_magic_file_type correctly reflects the updated naming convention whilst preserving the user-facing -f/--file-types option unchanged.

snapm/fsdiff/treewalk.py (5)

55-94: Excellent comprehensive hazard mitigation.

The _ALWAYS_EXCLUDE_PATTERNS constant provides thorough coverage of dangerous system paths with clear categorisation and inline documentation. The patterns correctly address known hazards including:

  • Blocking streams (/proc/kmsg, /dev/console)
  • Infinite data sources (/dev/zero, /dev/random)
  • System memory access (/proc/kcore, /dev/mem)
  • Hardware triggers (/dev/watchdog*)

The glob patterns are appropriate for use with fnmatch.


243-243: LGTM! Improved output formatting.

The formatting change to print the stat block on a new line enhances readability of the FsEntry string representation.


463-465: LGTM! Consistent file type detection for directories.

File type detection is now always invoked for directories with the configurable use_magic flag, enabling best-effort type guessing when libmagic is unavailable. This aligns with PR objectives #801 and #803.


482-484: LGTM! Consistent file type detection for regular files.

File type detection is now always invoked for regular files with the configurable use_magic flag, mirroring the directory handling and ensuring consistent type information throughout the tree walk.


581-586: LGTM! Correct implementation of always-exclude logic.

The combination of _ALWAYS_EXCLUDE_PATTERNS with user-provided exclusion patterns correctly implements the "hard exclude" behaviour that cannot be bypassed by --include-system-dirs. The patterns are properly applied using fnmatch to filter paths during tree traversal.

snapm/fsdiff/engine.py (5)

86-86: LGTM! Consistent field initialization pattern.

The file_type_desc field initialization follows the established pattern of using a helper method, maintaining consistency with file_type and file_category initialization.


261-278: LGTM! Well-structured file type description logic.

The _get_file_type_desc() method provides clear human-readable descriptions with appropriate fallback behaviour:

  • Special handling for directories and symbolic links
  • Leverages file_type_info.description when available
  • Graceful fallback to "unknown" for edge cases

The descriptions enhance user-facing output quality as intended by PR objective #808.


168-168: LGTM! Logical field placement in string representation.

The file_type_desc field is appropriately positioned after file_type in the string representation, maintaining logical grouping of related information.


199-199: LGTM! Complete API surface coverage.

The file_type_desc field is correctly included in the dictionary representation, ensuring the field is available in JSON output and maintaining consistency with the string representation.


661-677: LGTM! Improved output consistency and completeness.

The updates to short() method enhance output quality:

  • Field labels now match actual attribute names (content_diff_summary, changes)
  • Includes the new file_type_desc field for richer type information
  • Maintains clean, readable formatting

These changes align with PR objective #808 for consistent full/short formatting.

@bmr-cymru bmr-cymru force-pushed the bmr-extended-file-types branch from 516e7b5 to 2aa4499 Compare December 20, 2025 20:13
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
snapm/fsdiff/filetypes.py (3)

475-508: Typo in docstring.

Line 492 contains a typo: "desciption" should be "description".

🔎 Proposed fix
-    :returns: A 3-tuple containing (mime_type, desciption, encoding) if the
+    :returns: A 3-tuple containing (mime_type, description, encoding) if the
               type could be guessed or ``None`` otherwise.

556-574: Typo in docstring.

Line 563 contains a typo: "desciption" should be "description".

🔎 Proposed fix
-    :returns: A 3-tuple containing (mime_type, desciption, encoding).
+    :returns: A 3-tuple containing (mime_type, description, encoding).

511-524: Typo in docstring.

Line 518 contains a typo: "desciption" should be "description".

🔎 Proposed fix
-    :returns: A 3-tuple containing (mime_type, desciption, encoding) if the
+    :returns: A 3-tuple containing (mime_type, description, encoding) if the
               type could be guessed or ``None`` otherwise.
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 516e7b5 and 2aa4499.

📒 Files selected for processing (8)
  • scripts/difftest.py (1 hunks)
  • snapm/command.py (1 hunks)
  • snapm/fsdiff/engine.py (5 hunks)
  • snapm/fsdiff/filetypes.py (5 hunks)
  • snapm/fsdiff/options.py (1 hunks)
  • snapm/fsdiff/treewalk.py (5 hunks)
  • tests/fsdiff/test_engine.py (1 hunks)
  • tests/fsdiff/test_filetypes.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • snapm/command.py
  • scripts/difftest.py
  • tests/fsdiff/test_engine.py
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 541
File: snapm/manager/_mounts.py:174-197
Timestamp: 2025-10-22T20:06:54.993Z
Learning: For snapshotmanager/snapm follow-up requests from bmr-cymru, create a GitHub issue with label "enhancement", set/ensure a "type: Task" label, and assign it to bmr-cymru. Include backlinks to the PR and the discussion comment in the issue body.
📚 Learning: 2025-12-10T18:28:51.721Z
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 0
File: :0-0
Timestamp: 2025-12-10T18:28:51.721Z
Learning: In the snapshotmanager/snapm codebase, `TreeWalker.walk_tree()` in `snapm/_fsdiff/treewalk.py` eagerly materializes all paths before processing (using `os.walk`) to compute the total count for progress reporting. This design is intentional and necessary because accurate progress updates require knowing the total upfront. The custom `snapm._progress` implementation is lightweight (~2000 lines with tests) and RHEL-compatible, unlike alternatives like `tqdm` (>8000 lines, not in RHEL). The eager materialization trade-off is well-justified for the UX benefit of progress reporting on large trees.

Applied to files:

  • snapm/fsdiff/treewalk.py
📚 Learning: 2025-12-02T16:08:02.588Z
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 0
File: :0-0
Timestamp: 2025-12-02T16:08:02.588Z
Learning: In the snapshotmanager/snapm codebase, `FileTypeInfo` in `snapm/_fsdiff/filetypes.py` is not suitable for conversion to `dataclass` because it requires custom `__init__` logic to compute `is_text_like` from the `category` parameter, and has a custom `__str__` method that formats field names as human-readable strings (e.g., "MIME type:"). The project uses `dataclass` extensively for classes like `SnapmConfig` and `GcPolicyParams*`, but only where it doesn't interfere with custom initialization or formatting requirements.

Applied to files:

  • snapm/fsdiff/options.py
  • snapm/fsdiff/filetypes.py
  • snapm/fsdiff/engine.py
📚 Learning: 2025-12-14T12:52:14.459Z
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 707
File: snapm/fsdiff/engine.py:327-405
Timestamp: 2025-12-14T12:52:14.459Z
Learning: In snapm/fsdiff/engine.py, render_unified_diff intentionally uses tc.WHITE instead of tc.NORMAL to reset colors after diff lines. This avoids breaking less -R output when piping through files, and is intended to support --color=always with less -R. When reviewing changes in this file, ensure any color reset logic preserves compatibility with downstream pagers/filters; if you introduce color resets, test with --color=always and piping to less -R. If you modify reset color behavior, consider updating tests and documenting rationale.

Applied to files:

  • snapm/fsdiff/engine.py
🧬 Code graph analysis (3)
snapm/fsdiff/treewalk.py (1)
snapm/fsdiff/filetypes.py (1)
  • detect_file_type (753-786)
snapm/fsdiff/engine.py (1)
snapm/fsdiff/treewalk.py (2)
  • is_dir (310-318)
  • is_symlink (321-329)
tests/fsdiff/test_filetypes.py (1)
snapm/fsdiff/filetypes.py (1)
  • detect_file_type (753-786)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (39)
  • GitHub Check: rpm-build:fedora-42-x86_64
  • GitHub Check: rpm-build:centos-stream-9-x86_64
  • GitHub Check: rpm-build:centos-stream-10-x86_64
  • GitHub Check: rpm-build:fedora-rawhide-x86_64
  • GitHub Check: rpm-build:fedora-43-x86_64
  • GitHub Check: rpm-build:centos-stream-9-x86_64
  • GitHub Check: rpm-build:centos-stream-10-x86_64
  • GitHub Check: rpm-build:fedora-42-x86_64
  • GitHub Check: rpm-build:fedora-rawhide-x86_64
  • GitHub Check: rpm-build:epel-9-x86_64
  • GitHub Check: rpm-build:fedora-43-x86_64
  • GitHub Check: rpm-build:fedora-rawhide-x86_64
  • GitHub Check: rpm-build:fedora-42-x86_64
  • GitHub Check: rpm-build:epel-9-x86_64
  • GitHub Check: rpm-build:fedora-43-x86_64
  • GitHub Check: rpm-build:centos-stream-9-x86_64
  • GitHub Check: rpm-build:centos-stream-10-x86_64
  • GitHub Check: rpm-build:fedora-43-x86_64
  • GitHub Check: rpm-build:centos-stream-9-x86_64
  • GitHub Check: rpm-build:epel-9-x86_64
  • GitHub Check: rpm-build:fedora-42-x86_64
  • GitHub Check: rpm-build:fedora-rawhide-x86_64
  • GitHub Check: rpm-build:centos-stream-10-x86_64
  • GitHub Check: virt_tests (bios, lvm-thin, fedora41)
  • GitHub Check: virt_tests (bios, lvm-thin, fedora42)
  • GitHub Check: virt_tests (bios, lvm-thin, centos-stream10)
  • GitHub Check: virt_tests (uefi, lvm, fedora42)
  • GitHub Check: virt_tests (bios, lvm, centos-stream10)
  • GitHub Check: virt_tests (uefi, lvm, fedora41)
  • GitHub Check: virt_tests (uefi, lvm-thin, fedora42)
  • GitHub Check: virt_tests (uefi, lvm-thin, centos-stream9)
  • GitHub Check: virt_tests (bios, lvm, fedora41)
  • GitHub Check: virt_tests (uefi, lvm, centos-stream9)
  • GitHub Check: virt_tests (bios, lvm, centos-stream9)
  • GitHub Check: virt_tests (bios, lvm, fedora42)
  • GitHub Check: virt_tests (uefi, lvm-thin, centos-stream10)
  • GitHub Check: virt_tests (bios, lvm-thin, centos-stream9)
  • GitHub Check: virt_tests (uefi, lvm, centos-stream10)
  • GitHub Check: python-basic-tests
🔇 Additional comments (14)
tests/fsdiff/test_filetypes.py (1)

34-34: Tests correctly updated for new API.

The test call sites properly pass use_magic=True to exercise the magic-based file type detection path, aligning with the new parameter introduced in detect_file_type.

Also applies to: 47-47, 75-75

snapm/fsdiff/options.py (1)

44-44: Field rename correctly implemented.

The rename from include_file_type to use_magic_file_type clearly reflects the distinction between magic-based and non-magic file type detection, improving API clarity.

snapm/fsdiff/treewalk.py (3)

55-94: Excellent hard exclusion patterns.

The _ALWAYS_EXCLUDE_PATTERNS comprehensively covers hazardous filesystem paths that should never be traversed, including blocking streams, watchdogs, and debug interfaces. The categorization and inline comments are helpful.


463-465: File type detection consistently updated.

Both _process_dir and _process_file now unconditionally compute file type information, passing the use_magic parameter from options. This ensures consistent behaviour across the codebase.

Also applies to: 482-484


581-586: Exclusion logic correctly implements hard exclusions.

The combination of _ALWAYS_EXCLUDE_PATTERNS with self.exclude_patterns ensures that dangerous paths are always excluded, regardless of user configuration. The fnmatch check correctly applies both sets of patterns.

snapm/fsdiff/filetypes.py (6)

32-472: Comprehensive file type mappings added.

The extensive text and binary file type mappings provide excellent coverage for best-effort file type detection without libmagic. The organization by category (general text, web standards, source code, archives, media, etc.) makes the mappings maintainable.


594-596: New enum variants correctly added.

The SOURCE_CODE, CERTIFICATE, and SYMLINK category additions provide better file type granularity as specified in issue #805.


658-751: Excellent expansion of MIME type categorization.

The expanded category_rules provide comprehensive coverage of MIME types across archives, executables, documents, configurations, databases, source code, and certificates. This significantly improves categorization accuracy.


753-786: Dual detection paths correctly implemented.

The use_magic parameter enables both magic-based and best-effort non-magic file type detection. The magic path preserves existing error handling, while the non-magic path provides a fallback. The default of False encourages non-magic detection as intended.


809-811: Systemd unit file detection correctly implemented.

The special handling for systemd unit files outside /etc ensures they are correctly categorized as CONFIG based on their extension.


820-834: Non-magic detection method correctly implemented.

The _guess_file_type method properly integrates the non-magic guessing helpers with the existing categorization logic, returning a complete FileTypeInfo structure.

snapm/fsdiff/engine.py (3)

86-86: File type description field correctly implemented.

The file_type_desc field and _get_file_type_desc() helper provide human-readable file type descriptions, with appropriate special cases for directories and symlinks, and sensible fallbacks. This enhances reporting clarity as specified in issue #808.

Also applies to: 261-277


168-168: File type description properly exposed in outputs.

The file_type_desc field is correctly included in both the string representation and dictionary serialization, maintaining consistency with other reporting fields.

Also applies to: 199-199


661-676: Short format consistently updated with new fields.

The updated field labels (content_diff_summary, changes) and inclusion of file_type_desc improve output clarity and consistency as specified in issue #808.

@bmr-cymru bmr-cymru force-pushed the bmr-extended-file-types branch from 2aa4499 to d4be023 Compare December 20, 2025 20:28
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
tests/fsdiff/test_filetypes.py (1)

80-83: Consider adding tests for the non-magic guessing path.

The tests cover the use_magic=True path, but there appear to be no tests for the new _guess_file_type() fallback path (when use_magic=False). Consider adding tests that verify extension-based and path-based guessing works correctly.

Would you like me to open an issue to track adding test coverage for the non-magic file type guessing functionality?

snapm/fsdiff/filetypes.py (2)

285-439: Note: .ts extension conflict with TypeScript.

The .ts extension at line 371 is mapped to "video/mp2t" (MPEG Transport Stream), but it's also commonly used for TypeScript source files (mapped in TEXT_EXTENSION_MAP at line 92). Since _guess_file checks binary extensions first, TypeScript files would be misidentified as video files.

This is an inherent ambiguity with the .ts extension. For a best-effort guesser, you might consider:

  1. Removing .ts from BINARY_EXTENSION_MAP (TypeScript is more common in development contexts)
  2. Adding path-based heuristics (e.g., files near package.json are likely TypeScript)
  3. Documenting this limitation

820-834: Add return type annotation to function signature.

The docstring correctly specifies `:rtype: ``FileTypeInfo```, but the function signature is missing the return type annotation for consistency with the rest of the codebase.

🔎 Proposed fix
-    def _guess_file_type(self, file_path: Path):
+    def _guess_file_type(self, file_path: Path) -> FileTypeInfo:
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2aa4499 and d4be023.

📒 Files selected for processing (8)
  • scripts/difftest.py (1 hunks)
  • snapm/command.py (1 hunks)
  • snapm/fsdiff/engine.py (5 hunks)
  • snapm/fsdiff/filetypes.py (5 hunks)
  • snapm/fsdiff/options.py (1 hunks)
  • snapm/fsdiff/treewalk.py (5 hunks)
  • tests/fsdiff/test_engine.py (1 hunks)
  • tests/fsdiff/test_filetypes.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • tests/fsdiff/test_engine.py
  • snapm/fsdiff/treewalk.py
  • snapm/command.py
  • scripts/difftest.py
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 541
File: snapm/manager/_mounts.py:174-197
Timestamp: 2025-10-22T20:06:54.993Z
Learning: For snapshotmanager/snapm follow-up requests from bmr-cymru, create a GitHub issue with label "enhancement", set/ensure a "type: Task" label, and assign it to bmr-cymru. Include backlinks to the PR and the discussion comment in the issue body.
📚 Learning: 2025-12-02T16:08:02.588Z
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 0
File: :0-0
Timestamp: 2025-12-02T16:08:02.588Z
Learning: In the snapshotmanager/snapm codebase, `FileTypeInfo` in `snapm/_fsdiff/filetypes.py` is not suitable for conversion to `dataclass` because it requires custom `__init__` logic to compute `is_text_like` from the `category` parameter, and has a custom `__str__` method that formats field names as human-readable strings (e.g., "MIME type:"). The project uses `dataclass` extensively for classes like `SnapmConfig` and `GcPolicyParams*`, but only where it doesn't interfere with custom initialization or formatting requirements.

Applied to files:

  • snapm/fsdiff/options.py
  • snapm/fsdiff/filetypes.py
  • snapm/fsdiff/engine.py
📚 Learning: 2025-10-22T20:06:54.993Z
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 541
File: snapm/manager/_mounts.py:174-197
Timestamp: 2025-10-22T20:06:54.993Z
Learning: For snapshotmanager/snapm follow-up requests from bmr-cymru, create a GitHub issue with label "enhancement", set/ensure a "type: Task" label, and assign it to bmr-cymru. Include backlinks to the PR and the discussion comment in the issue body.

Applied to files:

  • snapm/fsdiff/filetypes.py
📚 Learning: 2025-12-14T12:52:14.459Z
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 707
File: snapm/fsdiff/engine.py:327-405
Timestamp: 2025-12-14T12:52:14.459Z
Learning: In snapm/fsdiff/engine.py, render_unified_diff intentionally uses tc.WHITE instead of tc.NORMAL to reset colors after diff lines. This avoids breaking less -R output when piping through files, and is intended to support --color=always with less -R. When reviewing changes in this file, ensure any color reset logic preserves compatibility with downstream pagers/filters; if you introduce color resets, test with --color=always and piping to less -R. If you modify reset color behavior, consider updating tests and documenting rationale.

Applied to files:

  • snapm/fsdiff/engine.py
🧬 Code graph analysis (2)
tests/fsdiff/test_filetypes.py (1)
snapm/fsdiff/filetypes.py (1)
  • detect_file_type (753-786)
snapm/fsdiff/engine.py (1)
snapm/fsdiff/treewalk.py (2)
  • is_dir (310-318)
  • is_symlink (321-329)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (30)
  • GitHub Check: rpm-build:epel-9-x86_64
  • GitHub Check: rpm-build:centos-stream-9-x86_64
  • GitHub Check: rpm-build:fedora-rawhide-x86_64
  • GitHub Check: rpm-build:fedora-42-x86_64
  • GitHub Check: rpm-build:centos-stream-10-x86_64
  • GitHub Check: rpm-build:fedora-rawhide-x86_64
  • GitHub Check: rpm-build:fedora-42-x86_64
  • GitHub Check: rpm-build:epel-9-x86_64
  • GitHub Check: rpm-build:fedora-43-x86_64
  • GitHub Check: rpm-build:centos-stream-9-x86_64
  • GitHub Check: rpm-build:centos-stream-10-x86_64
  • GitHub Check: rpm-build:fedora-43-x86_64
  • GitHub Check: rpm-build:epel-9-x86_64
  • GitHub Check: rpm-build:centos-stream-10-x86_64
  • GitHub Check: rpm-build:centos-stream-9-x86_64
  • GitHub Check: rpm-build:fedora-rawhide-x86_64
  • GitHub Check: rpm-build:fedora-42-x86_64
  • GitHub Check: rpm-build:centos-stream-10-x86_64
  • GitHub Check: rpm-build:centos-stream-9-x86_64
  • GitHub Check: rpm-build:fedora-43-x86_64
  • GitHub Check: rpm-build:fedora-rawhide-x86_64
  • GitHub Check: rpm-build:fedora-42-x86_64
  • GitHub Check: rpm-build:epel-9-x86_64
  • GitHub Check: virt_tests (uefi, lvm-thin, fedora41)
  • GitHub Check: virt_tests (bios, lvm, centos-stream10)
  • GitHub Check: virt_tests (bios, lvm-thin, centos-stream9)
  • GitHub Check: virt_tests (uefi, lvm, centos-stream9)
  • GitHub Check: virt_tests (bios, lvm-thin, fedora42)
  • GitHub Check: virt_tests (uefi, lvm-thin, fedora42)
  • GitHub Check: python-basic-tests
🔇 Additional comments (17)
snapm/fsdiff/options.py (1)

43-44: LGTM!

The field rename from include_file_type to use_magic_file_type accurately reflects the new semantics where False triggers best-effort guessing and True uses libmagic. The comment clearly describes the behaviour.

tests/fsdiff/test_filetypes.py (3)

24-36: LGTM!

The test correctly passes use_magic=True to exercise the magic-based detection path, which aligns with the updated API where the default is now False. The mock setup and assertions remain valid.


38-49: LGTM!

Consistent update to use use_magic=True for the JSON config categorisation test.


67-78: LGTM!

Error handling test correctly updated to use use_magic=True since the error path only applies when libmagic is used.

snapm/fsdiff/engine.py (4)

86-86: LGTM!

The new file_type_desc field is initialised correctly alongside the existing file_type field, providing a human-readable description for reporting.


261-277: LGTM!

The _get_file_type_desc method follows the same pattern as _get_file_type and correctly returns descriptive strings for directories ("filesystem directory"), symlinks ("symbolic link"), and falls back to file_type_info.description when available.


186-199: LGTM!

The to_dict method correctly includes the new file_type_desc field, maintaining consistency with __str__ output.


650-679: LGTM!

The short() method output is now more consistent with full() by using matching field labels (content_diff_summary, changes, diff_type, file_type, file_type_desc). This addresses objective #808 for consistent full/short formatting.

snapm/fsdiff/filetypes.py (9)

32-242: LGTM!

The TEXT_EXTENSION_MAP provides comprehensive coverage for text-based file types including source code, configuration files, documentation, and web standards. The format is consistent with 2-tuples of (mime_type, description).


244-268: LGTM!

The TEXT_FILENAME_MAP handles common extensionless files correctly, including build files, release files, and documentation.


270-283: LGTM!

The SYSTEMD_UNIT_EXTENSIONS tuple provides a clear list of systemd unit file extensions for special handling in _categorize_file.


475-508: LGTM!

The _generic_guess_file function is well-designed with clear parameters for extension and filename maps. The docstring correctly documents the 3-tuple return type.


545-553: LGTM!

The logic correctly handles the edge case where text files (like shell scripts or config files) exist in binary directories. Converting abs_parent_path to string before the dictionary lookup addresses the type mismatch issue flagged in earlier reviews.


556-574: LGTM!

The _guess_file function provides a sensible fallback chain: binary guess → text guess → default octet-stream. This ensures a valid 3-tuple is always returned.


594-596: LGTM!

The new SOURCE_CODE, CERTIFICATE, and SYMLINK enum values align with objective #805 and are properly integrated into category_rules.


658-751: LGTM!

The expanded category_rules provides comprehensive MIME-type categorisation. The # fmt: off/on comments are appropriate for maintaining readability of the dictionary structure.


753-786: LGTM!

The detect_file_type method correctly implements the dual-path detection:

  • use_magic=True: Uses libmagic with proper error handling for platform variations (c9s magic.error)
  • use_magic=False: Falls back to _guess_file_type

The default use_magic=False aligns with the PR objective to provide best-effort detection without requiring libmagic.

@bmr-cymru bmr-cymru force-pushed the bmr-extended-file-types branch 2 times, most recently from 236d899 to 122d66e Compare December 20, 2025 21:11
@bmr-cymru
Copy link
Contributor Author

@coderabbitai: full review

@coderabbitai
Copy link

coderabbitai bot commented Dec 20, 2025

@bmr-cymru: I'll conduct a full review of all the changes in this PR.


🧠 Learnings used
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 541
File: snapm/manager/_mounts.py:174-197
Timestamp: 2025-10-22T20:06:54.993Z
Learning: For snapshotmanager/snapm follow-up requests from bmr-cymru, create a GitHub issue with label "enhancement", set/ensure a "type: Task" label, and assign it to bmr-cymru. Include backlinks to the PR and the discussion comment in the issue body.
✅ Actions performed

Full review triggered.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tests/fsdiff/test_filetypes.py (1)

24-37: Tests now explicitly exercise the magic-based detection path

The updated calls to detect_file_type(..., use_magic=True) keep these tests focused on the libmagic code path, and the error-handling test remains guarded correctly by hasattr(magic, "error"). You may want a small additional test for the non-magic guesser in future, but it is not required for this PR.

Also applies to: 38-50, 68-79

snapm/fsdiff/filetypes.py (1)

577-602: Source-code categorisation and text-likeness heuristics could be sharpened

The new SOURCE_CODE, CERTIFICATE, and SYMLINK categories and the expanded category_rules are a nice improvement, but two aspects slightly undermine their effect:

  • In category_rules, the generic "text/" prefix rule appears before the more specific text/x-* source-code entries. Because you iterate in insertion order and use mime_type.startswith(pattern), any text/x-python, text/x-c, etc. MIME will be classified as TEXT before the SOURCE_CODE rules are considered.
  • FileTypeInfo.is_text_like only treats TEXT, CONFIG, and LOG as text-like. Any MIME now classified as SOURCE_CODE (e.g. application/javascript or similar) will be considered non-text, which may affect any callers relying on is_text_like (for example, to decide when to generate unified vs binary diffs).

If you want SOURCE_CODE to behave as a distinct but still text-like category, you could:

  • Move the generic "text/" rule to the end of category_rules, after the more specific text/x-* and text/html/text/css entries, or split rules into an "exact" vs "prefix" set and check exact matches first.
  • Include FileTypeCategory.SOURCE_CODE (and possibly DOCUMENT) in the is_text_like set.

This is behavioural polish rather than a blocker, so it can be deferred, but it will make the new categories more effective.

Also applies to: 658-751, 788-819, 820-834

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3feeadd and 122d66e.

📒 Files selected for processing (8)
  • scripts/difftest.py (1 hunks)
  • snapm/command.py (1 hunks)
  • snapm/fsdiff/engine.py (5 hunks)
  • snapm/fsdiff/filetypes.py (5 hunks)
  • snapm/fsdiff/options.py (1 hunks)
  • snapm/fsdiff/treewalk.py (5 hunks)
  • tests/fsdiff/test_engine.py (1 hunks)
  • tests/fsdiff/test_filetypes.py (3 hunks)
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 541
File: snapm/manager/_mounts.py:174-197
Timestamp: 2025-10-22T20:06:54.993Z
Learning: For snapshotmanager/snapm follow-up requests from bmr-cymru, create a GitHub issue with label "enhancement", set/ensure a "type: Task" label, and assign it to bmr-cymru. Include backlinks to the PR and the discussion comment in the issue body.
📚 Learning: 2025-12-02T16:08:02.588Z
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 0
File: :0-0
Timestamp: 2025-12-02T16:08:02.588Z
Learning: In the snapshotmanager/snapm codebase, `FileTypeInfo` in `snapm/_fsdiff/filetypes.py` is not suitable for conversion to `dataclass` because it requires custom `__init__` logic to compute `is_text_like` from the `category` parameter, and has a custom `__str__` method that formats field names as human-readable strings (e.g., "MIME type:"). The project uses `dataclass` extensively for classes like `SnapmConfig` and `GcPolicyParams*`, but only where it doesn't interfere with custom initialization or formatting requirements.

Applied to files:

  • snapm/fsdiff/options.py
  • snapm/fsdiff/engine.py
  • snapm/fsdiff/filetypes.py
  • tests/fsdiff/test_filetypes.py
📚 Learning: 2025-12-14T12:52:14.459Z
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 707
File: snapm/fsdiff/engine.py:327-405
Timestamp: 2025-12-14T12:52:14.459Z
Learning: In snapm/fsdiff/engine.py, render_unified_diff intentionally uses tc.WHITE instead of tc.NORMAL to reset colors after diff lines. This avoids breaking less -R output when piping through files, and is intended to support --color=always with less -R. When reviewing changes in this file, ensure any color reset logic preserves compatibility with downstream pagers/filters; if you introduce color resets, test with --color=always and piping to less -R. If you modify reset color behavior, consider updating tests and documenting rationale.

Applied to files:

  • snapm/fsdiff/engine.py
📚 Learning: 2025-10-22T20:06:54.993Z
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 541
File: snapm/manager/_mounts.py:174-197
Timestamp: 2025-10-22T20:06:54.993Z
Learning: For snapshotmanager/snapm follow-up requests from bmr-cymru, create a GitHub issue with label "enhancement", set/ensure a "type: Task" label, and assign it to bmr-cymru. Include backlinks to the PR and the discussion comment in the issue body.

Applied to files:

  • snapm/fsdiff/filetypes.py
📚 Learning: 2025-12-10T18:28:51.721Z
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 0
File: :0-0
Timestamp: 2025-12-10T18:28:51.721Z
Learning: In the snapshotmanager/snapm codebase, `TreeWalker.walk_tree()` in `snapm/_fsdiff/treewalk.py` eagerly materializes all paths before processing (using `os.walk`) to compute the total count for progress reporting. This design is intentional and necessary because accurate progress updates require knowing the total upfront. The custom `snapm._progress` implementation is lightweight (~2000 lines with tests) and RHEL-compatible, unlike alternatives like `tqdm` (>8000 lines, not in RHEL). The eager materialization trade-off is well-justified for the UX benefit of progress reporting on large trees.

Applied to files:

  • snapm/fsdiff/treewalk.py
📚 Learning: 2025-09-24T02:49:21.752Z
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 490
File: snapm/manager/plugins/lvm2.py:618-629
Timestamp: 2025-09-24T02:49:21.752Z
Learning: When API changes affect return types (like changing from int to tuple), all supporting infrastructure needs updating including mock scripts in tests, not just the call sites that consume the API.

Applied to files:

  • tests/fsdiff/test_engine.py
🧬 Code graph analysis (4)
snapm/fsdiff/engine.py (1)
snapm/fsdiff/treewalk.py (2)
  • is_dir (310-318)
  • is_symlink (321-329)
snapm/fsdiff/treewalk.py (1)
snapm/fsdiff/filetypes.py (1)
  • detect_file_type (753-786)
tests/fsdiff/test_engine.py (1)
snapm/fsdiff/engine.py (1)
  • short (650-679)
tests/fsdiff/test_filetypes.py (1)
snapm/fsdiff/filetypes.py (1)
  • detect_file_type (753-786)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (24)
  • GitHub Check: rpm-build:fedora-42-x86_64
  • GitHub Check: rpm-build:epel-9-x86_64
  • GitHub Check: rpm-build:centos-stream-10-x86_64
  • GitHub Check: rpm-build:centos-stream-9-x86_64
  • GitHub Check: rpm-build:fedora-rawhide-x86_64
  • GitHub Check: rpm-build:fedora-43-x86_64
  • GitHub Check: rpm-build:fedora-43-x86_64
  • GitHub Check: rpm-build:fedora-rawhide-x86_64
  • GitHub Check: rpm-build:centos-stream-10-x86_64
  • GitHub Check: rpm-build:fedora-42-x86_64
  • GitHub Check: rpm-build:centos-stream-9-x86_64
  • GitHub Check: rpm-build:epel-9-x86_64
  • GitHub Check: rpm-build:fedora-43-x86_64
  • GitHub Check: rpm-build:fedora-rawhide-x86_64
  • GitHub Check: rpm-build:centos-stream-10-x86_64
  • GitHub Check: rpm-build:fedora-42-x86_64
  • GitHub Check: rpm-build:centos-stream-9-x86_64
  • GitHub Check: rpm-build:epel-9-x86_64
  • GitHub Check: rpm-build:fedora-42-x86_64
  • GitHub Check: rpm-build:centos-stream-9-x86_64
  • GitHub Check: rpm-build:fedora-rawhide-x86_64
  • GitHub Check: rpm-build:fedora-43-x86_64
  • GitHub Check: rpm-build:centos-stream-10-x86_64
  • GitHub Check: rpm-build:epel-9-x86_64
🔇 Additional comments (7)
snapm/fsdiff/engine.py (1)

82-101: File type description plumbing in FsDiffRecord/FsDiffResults looks sound

file_type_desc is derived once in __init__ with sensible fallbacks (directory, symbolic link, description from file_type_info, else "unknown"), is included consistently in __str__, to_dict(), and the short() formatter, and does not alter existing control flow for diffs or summaries. This gives more human-friendly output without changing diff semantics.

Also applies to: 157-183, 186-229, 243-260, 261-278, 315-337, 650-679

tests/fsdiff/test_engine.py (1)

72-81: Tests correctly track updated diff output fields

The short/full output expectations now align with the new labels (diff_type and content_diff_summary) and exercise the modified formatting without over-constraining surrounding text.

scripts/difftest.py (1)

27-35: CLI wiring for -f/--file-types matches new DiffOptions flag

The argparse destination now matches DiffOptions.use_magic_file_type, so difftest.py will correctly toggle magic-based detection without changing the user-visible option.

snapm/fsdiff/treewalk.py (2)

55-105: Hard excludes are applied safely and cannot be bypassed

_ALWAYS_EXCLUDE_PATTERNS covers the right set of pathological /proc, /dev, and /sys entries, and the fnmatch check against stripped_pathname happens before any stat() or content access. Combining these with self.exclude_patterns means they are enforced even when include_system_dirs=True, while still honouring user excludes elsewhere. This is a good balance between safety and configurability.

Also applies to: 580-593


450-498: Always computing file_type_info aligns TreeWalker with the new reporting model

_process_dir and _process_file now always call detect_file_type (magic vs guess controlled by options.use_magic_file_type), which ensures every FsEntry has file_type_info filled in for use by FsDiffRecord (file_type/file_type_desc and category). This slightly increases per-entry work but keeps magic usage fully opt-in and uses lightweight name-based guessing otherwise, so it should be acceptable for large walks.

Also applies to: 621-643

snapm/command.py (1)

2698-2721: CLI option now correctly targets use_magic_file_type

The -f/--file-types option for snapm snapset diff and diffreport now properly populates use_magic_file_type, aligning with the DiffOptions field and FileTypeDetector.detect_file_type method. The migration is complete with no remaining references to the previous field name.

snapm/fsdiff/options.py (1)

31-45: DiffOptions flag rename is clean and complete; no stale references detected

The use_magic_file_type field integrates cleanly with from_cmd_args, which pulls values by dataclass field names. Verification confirms the argparse destination at line 2718 of snapm/command.py correctly uses dest="use_magic_file_type", matching the dataclass field exactly. No remaining references to the old include_file_type name exist anywhere in the codebase.

Resolves: #802

Signed-off-by: Bryn M. Reeves <bmr@redhat.com>
Resolves: #805

Signed-off-by: Bryn M. Reeves <bmr@redhat.com>
@bmr-cymru bmr-cymru force-pushed the bmr-extended-file-types branch from 122d66e to bb3661c Compare December 21, 2025 00:15
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
snapm/fsdiff/filetypes.py (1)

647-656: Clarify operator precedence in is_text_like expression.

The expression mixing and and or operators lacks explicit parentheses, making the intended precedence unclear:

self.is_text_like = category in (...) or (
    category == FileTypeCategory.DOCUMENT
    and mime_type.startswith("text/")
    or mime_type in self.TEXT_DOCUMENTS
)

Due to operator precedence, and binds tighter than or, so the inner expression is parsed as:

(category == FileTypeCategory.DOCUMENT and mime_type.startswith("text/")) or (mime_type in self.TEXT_DOCUMENTS)

This means any file with a MIME type in TEXT_DOCUMENTS will be marked text-like regardless of its category. If the intent is that TEXT_DOCUMENTS should only apply to DOCUMENT category files, add explicit parentheses.

🔎 Proposed fix if TEXT_DOCUMENTS should be category-agnostic
 self.is_text_like = category in (
     FileTypeCategory.TEXT,
     FileTypeCategory.CONFIG,
     FileTypeCategory.LOG,
     FileTypeCategory.SOURCE_CODE,
-) or (
-    category == FileTypeCategory.DOCUMENT
-    and mime_type.startswith("text/")
-    or mime_type in self.TEXT_DOCUMENTS
-)
+) or (category == FileTypeCategory.DOCUMENT and mime_type.startswith("text/")) \
+  or mime_type in self.TEXT_DOCUMENTS

Or if TEXT_DOCUMENTS should only apply to DOCUMENT category:

 self.is_text_like = category in (
     FileTypeCategory.TEXT,
     FileTypeCategory.CONFIG,
     FileTypeCategory.LOG,
     FileTypeCategory.SOURCE_CODE,
 ) or (
     category == FileTypeCategory.DOCUMENT
-    and mime_type.startswith("text/")
-    or mime_type in self.TEXT_DOCUMENTS
+    and (mime_type.startswith("text/") or mime_type in self.TEXT_DOCUMENTS)
 )
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 122d66e and bb3661c.

📒 Files selected for processing (8)
  • scripts/difftest.py (1 hunks)
  • snapm/command.py (1 hunks)
  • snapm/fsdiff/engine.py (5 hunks)
  • snapm/fsdiff/filetypes.py (7 hunks)
  • snapm/fsdiff/options.py (1 hunks)
  • snapm/fsdiff/treewalk.py (5 hunks)
  • tests/fsdiff/test_engine.py (1 hunks)
  • tests/fsdiff/test_filetypes.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • tests/fsdiff/test_engine.py
  • snapm/fsdiff/treewalk.py
  • scripts/difftest.py
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 541
File: snapm/manager/_mounts.py:174-197
Timestamp: 2025-10-22T20:06:54.993Z
Learning: For snapshotmanager/snapm follow-up requests from bmr-cymru, create a GitHub issue with label "enhancement", set/ensure a "type: Task" label, and assign it to bmr-cymru. Include backlinks to the PR and the discussion comment in the issue body.
📚 Learning: 2025-12-02T16:08:02.588Z
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 0
File: :0-0
Timestamp: 2025-12-02T16:08:02.588Z
Learning: In the snapshotmanager/snapm codebase, `FileTypeInfo` in `snapm/_fsdiff/filetypes.py` is not suitable for conversion to `dataclass` because it requires custom `__init__` logic to compute `is_text_like` from the `category` parameter, and has a custom `__str__` method that formats field names as human-readable strings (e.g., "MIME type:"). The project uses `dataclass` extensively for classes like `SnapmConfig` and `GcPolicyParams*`, but only where it doesn't interfere with custom initialization or formatting requirements.

Applied to files:

  • snapm/fsdiff/options.py
  • snapm/fsdiff/filetypes.py
  • snapm/fsdiff/engine.py
  • tests/fsdiff/test_filetypes.py
📚 Learning: 2025-10-22T20:06:54.993Z
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 541
File: snapm/manager/_mounts.py:174-197
Timestamp: 2025-10-22T20:06:54.993Z
Learning: For snapshotmanager/snapm follow-up requests from bmr-cymru, create a GitHub issue with label "enhancement", set/ensure a "type: Task" label, and assign it to bmr-cymru. Include backlinks to the PR and the discussion comment in the issue body.

Applied to files:

  • snapm/fsdiff/filetypes.py
📚 Learning: 2025-12-14T12:52:14.459Z
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 707
File: snapm/fsdiff/engine.py:327-405
Timestamp: 2025-12-14T12:52:14.459Z
Learning: In snapm/fsdiff/engine.py, render_unified_diff intentionally uses tc.WHITE instead of tc.NORMAL to reset colors after diff lines. This avoids breaking less -R output when piping through files, and is intended to support --color=always with less -R. When reviewing changes in this file, ensure any color reset logic preserves compatibility with downstream pagers/filters; if you introduce color resets, test with --color=always and piping to less -R. If you modify reset color behavior, consider updating tests and documenting rationale.

Applied to files:

  • snapm/fsdiff/engine.py
🧬 Code graph analysis (2)
snapm/fsdiff/engine.py (1)
snapm/fsdiff/treewalk.py (2)
  • is_dir (310-318)
  • is_symlink (321-329)
tests/fsdiff/test_filetypes.py (1)
snapm/fsdiff/filetypes.py (1)
  • detect_file_type (803-836)
🪛 Ruff (0.14.8)
snapm/fsdiff/filetypes.py

653-654: Parenthesize a and b expressions when chaining and and or together, to make the precedence clear

Parenthesize the and subexpression

(RUF021)


727-727: Dictionary key literal "text/x-tex" repeated

Remove repeated key literal "text/x-tex"

(F601)


761-761: Dictionary key literal "application/x-sh" repeated

Remove repeated key literal "application/x-sh"

(F601)


762-762: Dictionary key literal "application/x-sh" repeated

Remove repeated key literal "application/x-sh"

(F601)


767-767: Dictionary key literal "application/x-csh" repeated

Remove repeated key literal "application/x-csh"

(F601)


769-769: Dictionary key literal "application/x-bat" repeated

Remove repeated key literal "application/x-bat"

(F601)


771-771: Dictionary key literal "text/x-powershell" repeated

Remove repeated key literal "text/x-powershell"

(F601)


772-772: Dictionary key literal "text/x-powershell" repeated

Remove repeated key literal "text/x-powershell"

(F601)


776-776: Dictionary key literal "text/x-perl" repeated

Remove repeated key literal "text/x-perl"

(F601)


777-777: Dictionary key literal "text/x-perl" repeated

Remove repeated key literal "text/x-perl"

(F601)


781-781: Dictionary key literal "text/x-awk" repeated

Remove repeated key literal "text/x-awk"

(F601)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (33)
  • GitHub Check: rpm-build:centos-stream-9-x86_64
  • GitHub Check: rpm-build:centos-stream-10-x86_64
  • GitHub Check: rpm-build:fedora-42-x86_64
  • GitHub Check: rpm-build:fedora-rawhide-x86_64
  • GitHub Check: rpm-build:epel-9-x86_64
  • GitHub Check: rpm-build:fedora-43-x86_64
  • GitHub Check: rpm-build:centos-stream-9-x86_64
  • GitHub Check: rpm-build:centos-stream-10-x86_64
  • GitHub Check: rpm-build:fedora-42-x86_64
  • GitHub Check: rpm-build:fedora-rawhide-x86_64
  • GitHub Check: rpm-build:epel-9-x86_64
  • GitHub Check: rpm-build:fedora-43-x86_64
  • GitHub Check: rpm-build:fedora-42-x86_64
  • GitHub Check: rpm-build:centos-stream-9-x86_64
  • GitHub Check: rpm-build:fedora-rawhide-x86_64
  • GitHub Check: rpm-build:fedora-43-x86_64
  • GitHub Check: rpm-build:centos-stream-10-x86_64
  • GitHub Check: rpm-build:epel-9-x86_64
  • GitHub Check: virt_tests (bios, lvm, fedora42)
  • GitHub Check: virt_tests (uefi, lvm, centos-stream10)
  • GitHub Check: virt_tests (bios, lvm, centos-stream10)
  • GitHub Check: virt_tests (bios, lvm, centos-stream9)
  • GitHub Check: virt_tests (uefi, lvm-thin, fedora42)
  • GitHub Check: virt_tests (uefi, lvm, centos-stream9)
  • GitHub Check: virt_tests (bios, lvm-thin, fedora41)
  • GitHub Check: virt_tests (bios, lvm-thin, centos-stream9)
  • GitHub Check: virt_tests (bios, lvm-thin, centos-stream10)
  • GitHub Check: virt_tests (bios, lvm, fedora41)
  • GitHub Check: virt_tests (bios, lvm-thin, fedora42)
  • GitHub Check: virt_tests (uefi, lvm-thin, centos-stream10)
  • GitHub Check: virt_tests (uefi, lvm-thin, centos-stream9)
  • GitHub Check: virt_tests (uefi, lvm, fedora42)
  • GitHub Check: virt_tests (uefi, lvm, fedora41)
🔇 Additional comments (9)
snapm/command.py (1)

2718-2718: LGTM - CLI argument destination renamed correctly.

The destination rename from include_file_type to use_magic_file_type accurately reflects that libmagic usage is now optional, whilst preserving the existing -f/--file-types CLI flags.

snapm/fsdiff/options.py (1)

44-44: LGTM - DiffOptions field renamed consistently.

The field rename from include_file_type to use_magic_file_type aligns with the CLI changes and accurately describes the optional nature of libmagic-based detection.

tests/fsdiff/test_filetypes.py (1)

34-34: LGTM - Test updates align with new API.

The addition of use_magic=True to the detect_file_type() calls correctly exercises the libmagic-based detection path, ensuring tests validate the magic-enabled workflow.

Also applies to: 47-47, 75-75

snapm/fsdiff/engine.py (2)

86-86: LGTM - file_type_desc field integration looks solid.

The new file_type_desc field and _get_file_type_desc() method provide user-friendly file type descriptions. The implementation correctly handles directories, symlinks, and falls back to magic/guessed descriptions, with consistent integration across __str__(), to_dict(), and reporting outputs.

Also applies to: 168-168, 199-199, 261-277


661-677: LGTM - short() format updated consistently.

The short output now includes diff_type, file_type, and file_type_desc fields, and correctly uses the content_diff_summary key name (aligned with the field name rather than a generic "change_summary").

snapm/fsdiff/filetypes.py (4)

32-253: LGTM - comprehensive file type mappings.

The extensive TEXT_EXTENSION_MAP, TEXT_FILENAME_MAP, BINARY_EXTENSION_MAP, BINARY_FILENAME_MAP, and BINARY_FILE_PATHS provide robust coverage for best-effort file type detection. The use of glob patterns in filename maps (e.g., "*makefile") is appropriate and will be correctly handled by Path.match().

Also applies to: 255-279, 281-294, 296-449, 451-462, 464-482


485-518: LGTM - guessing functions well structured.

The helper functions _generic_guess_file(), _guess_text_file(), _guess_binary_file(), and _guess_file() provide a sensible fallback hierarchy. The use of Path.match() for glob patterns is correct, and the logic in _guess_binary_file() to honour text patterns even in binary-heavy directories (line 559) is a good design choice.

Also applies to: 521-534, 537-563, 566-584


604-606: LGTM - FileTypeCategory extensions.

The addition of SOURCE_CODE, CERTIFICATE, and SYMLINK categories, along with TEXT_DOCUMENTS classification and systemd unit file handling, appropriately extends the categorisation system.

Also applies to: 619-622, 860-861


803-836: LGTM - use_magic parameter integration.

The updated detect_file_type() signature with use_magic=False parameter correctly routes to either magic-based detection or the new _guess_file_type() fallback, fulfilling the PR objectives for optional libmagic usage.

Also applies to: 870-884

Resolves: #806

Signed-off-by: Bryn M. Reeves <bmr@redhat.com>
…etector

Resolves: #803

Signed-off-by: Bryn M. Reeves <bmr@redhat.com>
Resolves: #807

Signed-off-by: Bryn M. Reeves <bmr@redhat.com>
Resolves: #808

Signed-off-by: Bryn M. Reeves <bmr@redhat.com>
Resolves: #804

Signed-off-by: Bryn M. Reeves <bmr@redhat.com>
Resolves: #809

Signed-off-by: Bryn M. Reeves <bmr@redhat.com>
@bmr-cymru bmr-cymru force-pushed the bmr-extended-file-types branch from bb3661c to e4ee2bf Compare December 21, 2025 02:11
If we don't have anyFileTypeInfo for an entry the current
old_entry/new_entry we fall back to TextContextDiffer:

611         try:
612             differ = (
613                 self.get_differ_for_file(file_type_info)
614                 if file_type_info is not None
615                 else TextContentDiffer()
616             )
617             return differ.generate_diff(old_path, new_path, old_entry, new_entry)

This is the root cause of the memory pressure problems on low-memory
systems that prompted #786, #789, #790, #798, and #800. These are all
good-to-haves but the reason we were generating 3GiB RSS on these 4GiB
systems was that without -f / --file-type we attempted to generate text
content diffs for everything that changed - hundreds of xz compressed
kernel modules for e.g. (since an add counts as modified and we diff
against /dev/null):

Saving cache:  38% [==========------------------] (Saving record 17525)
/usr/lib/modules/6.17.12-300.fc43.x86_64/kernel/drivers/infiniband/core/ib_core.ko.xz
    diff_type: unified
      old_content: ''
      new_content: '�7zXZ\x00\x00\x01i"�6\x04�Ч\x0f...'
      diff_data: <1923 items>
      summary: 0 deletions, 1923 additions
      has_changes: True
      error_message:
Saving cache:  38% [==========------------------] (Saving record 17526)
/usr/lib/modules/6.17.12-300.fc43.x86_64/kernel/drivers/infiniband/core/ib_umad.ko.xz
    diff_type: unified
      old_content: ''
      new_content: '�7zXZ\x00\x00\x01i"�6\x04���...'
      diff_data: <349 items>
      summary: 0 deletions, 349 additions
      has_changes: True
      error_message:
Saving cache:  38% [==========------------------] (Saving record 17527)
/usr/lib/modules/6.17.12-300.fc43.x86_64/kernel/drivers/infiniband/core/ib_uverbs.ko.xz
    diff_type: unified
      old_content: ''
      new_content: '�7zXZ\x00\x00\x01i"�6\x04���...'
      diff_data: <948 items>
      summary: 0 deletions, 948 additions
      has_changes: True
      error_message:

Since we now have magic-less best effort type guessing (#801 / #803)
this should be unreachable dead code, but for belts-and-braces let's
switch to the BinaryContentDiffer as the fallback default: it is safer
since it never attempts to compute actual diffs.

Resolves: #819

Signed-off-by: Bryn M. Reeves <bmr@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment