-
Notifications
You must be signed in to change notification settings - Fork 4
fsdiff: implement best-effort file type detection and "always exclude" paths #810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughRename include_file_type → use_magic_file_type across CLI and options; add best‑effort non‑magic file‑type guessing and many text/binary patterns; always‑exclude specific filesystem paths during tree walking; add human‑readable file_type_desc to FsDiffRecord and adjust short/full output fields and tests. (48 words) Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Possibly related issues
Possibly related PRs
Pre-merge checks and finishing touches✅ Passed checks (5 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: Organization UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used🧠 Learnings (1)📓 Common learnings⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (37)
🔇 Additional comments (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Congratulations! One of the builds has completed. 🍾 You can install the built RPMs by following these steps:
Please note that the RPMs should be used only in a testing environment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tests/fsdiff/test_filetypes.py (2)
70-78: Test no longer exercises the magic error handling path.With the signature change to
detect_file_type(file_path, use_magic=False), this test now defaults to the guessing path rather than the magic path. The mock onmagic.detect_from_filenamewon't be triggered, and the test passes for the wrong reason.🔎 Proposed fix
@unittest.skipIf(not hasattr(magic, "error"), "magic does not have magic.error") @patch("snapm.fsdiff.filetypes.magic.detect_from_filename") def test_detect_file_type_error(self, mock_magic): # Simulate magic library error mock_magic.side_effect = magic.error("magic failed") path = Path("/broken") - info = self.detector.detect_file_type(path) + info = self.detector.detect_file_type(path, use_magic=True) self.assertEqual(info.category, FileTypeCategory.BINARY) self.assertEqual(info.mime_type, "application/octet-stream")
24-36: Other tests using mocked magic also needuse_magic=True.The tests
test_categorize_log_fileandtest_categorize_json_configmockmagic.detect_from_filenamebut calldetect_file_typewithoutuse_magic=True. These tests will use the guessing path instead, making the mocks ineffective.🔎 Proposed fix
@patch("snapm.fsdiff.filetypes.magic.detect_from_filename") def test_categorize_log_file(self, mock_magic): # magic might say plain text, but filename says .log mock_res = MagicMock() mock_res.mime_type = "text/plain" mock_res.name = "ASCII text" mock_res.encoding = "us-ascii" mock_magic.return_value = mock_res path = Path("/var/log/syslog") - info = self.detector.detect_file_type(path) + info = self.detector.detect_file_type(path, use_magic=True) self.assertEqual(info.category, FileTypeCategory.LOG) @patch("snapm.fsdiff.filetypes.magic.detect_from_filename") def test_categorize_json_config(self, mock_magic): mock_res = MagicMock() mock_res.mime_type = "application/json" mock_res.name = "JSON data" mock_res.encoding = "us-ascii" mock_magic.return_value = mock_res path = Path("/etc/app/config.json") - info = self.detector.detect_file_type(path) + info = self.detector.detect_file_type(path, use_magic=True) self.assertEqual(info.category, FileTypeCategory.CONFIG)
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
scripts/difftest.py(1 hunks)snapm/command.py(1 hunks)snapm/fsdiff/engine.py(5 hunks)snapm/fsdiff/filetypes.py(5 hunks)snapm/fsdiff/options.py(1 hunks)snapm/fsdiff/treewalk.py(5 hunks)tests/fsdiff/test_engine.py(1 hunks)tests/fsdiff/test_filetypes.py(1 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 541
File: snapm/manager/_mounts.py:174-197
Timestamp: 2025-10-22T20:06:54.993Z
Learning: For snapshotmanager/snapm follow-up requests from bmr-cymru, create a GitHub issue with label "enhancement", set/ensure a "type: Task" label, and assign it to bmr-cymru. Include backlinks to the PR and the discussion comment in the issue body.
📚 Learning: 2025-12-02T16:08:02.588Z
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 0
File: :0-0
Timestamp: 2025-12-02T16:08:02.588Z
Learning: In the snapshotmanager/snapm codebase, `FileTypeInfo` in `snapm/_fsdiff/filetypes.py` is not suitable for conversion to `dataclass` because it requires custom `__init__` logic to compute `is_text_like` from the `category` parameter, and has a custom `__str__` method that formats field names as human-readable strings (e.g., "MIME type:"). The project uses `dataclass` extensively for classes like `SnapmConfig` and `GcPolicyParams*`, but only where it doesn't interfere with custom initialization or formatting requirements.
Applied to files:
snapm/fsdiff/options.pysnapm/fsdiff/filetypes.pysnapm/fsdiff/engine.pytests/fsdiff/test_filetypes.py
📚 Learning: 2025-12-14T12:52:14.459Z
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 707
File: snapm/fsdiff/engine.py:327-405
Timestamp: 2025-12-14T12:52:14.459Z
Learning: In snapm/fsdiff/engine.py, render_unified_diff intentionally uses tc.WHITE instead of tc.NORMAL to reset colors after diff lines. This avoids breaking less -R output when piping through files, and is intended to support --color=always with less -R. When reviewing changes in this file, ensure any color reset logic preserves compatibility with downstream pagers/filters; if you introduce color resets, test with --color=always and piping to less -R. If you modify reset color behavior, consider updating tests and documenting rationale.
Applied to files:
snapm/fsdiff/engine.py
📚 Learning: 2025-09-24T02:49:21.752Z
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 490
File: snapm/manager/plugins/lvm2.py:618-629
Timestamp: 2025-09-24T02:49:21.752Z
Learning: When API changes affect return types (like changing from int to tuple), all supporting infrastructure needs updating including mock scripts in tests, not just the call sites that consume the API.
Applied to files:
tests/fsdiff/test_filetypes.py
📚 Learning: 2025-12-10T18:28:51.721Z
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 0
File: :0-0
Timestamp: 2025-12-10T18:28:51.721Z
Learning: In the snapshotmanager/snapm codebase, `TreeWalker.walk_tree()` in `snapm/_fsdiff/treewalk.py` eagerly materializes all paths before processing (using `os.walk`) to compute the total count for progress reporting. This design is intentional and necessary because accurate progress updates require knowing the total upfront. The custom `snapm._progress` implementation is lightweight (~2000 lines with tests) and RHEL-compatible, unlike alternatives like `tqdm` (>8000 lines, not in RHEL). The eager materialization trade-off is well-justified for the UX benefit of progress reporting on large trees.
Applied to files:
snapm/fsdiff/treewalk.py
🧬 Code graph analysis (3)
snapm/fsdiff/engine.py (1)
snapm/fsdiff/treewalk.py (2)
is_dir(310-318)is_symlink(321-329)
snapm/fsdiff/treewalk.py (1)
snapm/fsdiff/filetypes.py (1)
detect_file_type(738-771)
tests/fsdiff/test_engine.py (1)
snapm/fsdiff/engine.py (1)
short(650-679)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (34)
- GitHub Check: rpm-build:centos-stream-9-x86_64
- GitHub Check: rpm-build:centos-stream-10-x86_64
- GitHub Check: rpm-build:fedora-42-x86_64
- GitHub Check: rpm-build:fedora-rawhide-x86_64
- GitHub Check: rpm-build:epel-9-x86_64
- GitHub Check: rpm-build:fedora-43-x86_64
- GitHub Check: rpm-build:fedora-rawhide-x86_64
- GitHub Check: rpm-build:fedora-42-x86_64
- GitHub Check: rpm-build:epel-9-x86_64
- GitHub Check: rpm-build:fedora-43-x86_64
- GitHub Check: rpm-build:centos-stream-9-x86_64
- GitHub Check: rpm-build:centos-stream-10-x86_64
- GitHub Check: rpm-build:fedora-43-x86_64
- GitHub Check: rpm-build:centos-stream-9-x86_64
- GitHub Check: rpm-build:epel-9-x86_64
- GitHub Check: rpm-build:fedora-42-x86_64
- GitHub Check: rpm-build:fedora-rawhide-x86_64
- GitHub Check: rpm-build:centos-stream-10-x86_64
- GitHub Check: virt_tests (uefi, lvm-thin, fedora42)
- GitHub Check: virt_tests (bios, lvm-thin, fedora42)
- GitHub Check: virt_tests (uefi, lvm, fedora41)
- GitHub Check: virt_tests (uefi, lvm, centos-stream10)
- GitHub Check: virt_tests (uefi, lvm, centos-stream9)
- GitHub Check: virt_tests (uefi, lvm-thin, fedora41)
- GitHub Check: virt_tests (bios, lvm-thin, centos-stream9)
- GitHub Check: virt_tests (uefi, lvm-thin, centos-stream9)
- GitHub Check: virt_tests (bios, lvm, centos-stream10)
- GitHub Check: virt_tests (uefi, lvm-thin, centos-stream10)
- GitHub Check: virt_tests (bios, lvm, centos-stream9)
- GitHub Check: virt_tests (bios, lvm-thin, fedora41)
- GitHub Check: virt_tests (bios, lvm-thin, centos-stream10)
- GitHub Check: virt_tests (uefi, lvm, fedora42)
- GitHub Check: virt_tests (bios, lvm, fedora41)
- GitHub Check: python-basic-tests
🔇 Additional comments (19)
snapm/fsdiff/filetypes.py (4)
32-242: LGTM! Comprehensive text extension mapping.The extensive mapping covers a wide range of text-based file types including documentation, configuration, web standards, scripting, source code, and more. The consistent format with MIME types and lowercase descriptions is well-structured.
579-581: LGTM! New FileTypeCategory enum variants.The new
SOURCE_CODE,CERTIFICATE, andSYMLINKcategories align with the PR objectives and provide more granular file type categorisation.
738-771: LGTM! Well-structured dual-path detection.The
detect_file_typemethod now cleanly handles both magic-based and guessing-based detection paths. Error handling for the magic path is robust with platform-specific exception handling for c9s compatibility.
801-815: LGTM! Clean delegation to guessing helper.The
_guess_file_typemethod properly wraps the module-level_guess_filefunction and integrates with the existing categorisation logic.scripts/difftest.py (1)
28-34: LGTM! Consistent rename of argument destination.The
destparameter rename frominclude_file_typetouse_magic_file_typealigns with the corresponding change inDiffOptionsand ensuresfrom_cmd_argscorrectly maps the CLI flag to the dataclass field.tests/fsdiff/test_engine.py (2)
73-76: LGTM! Test assertions updated to match new output format.The assertions now correctly expect the updated output field names (
diff_type:andcontent_diff_summary:) as reflected in theFsDiffResults.short()method shown in the relevant code snippet.
245-253: LGTM! FsDiffRecord string representation test updated.The test correctly reflects the updated
__str__output format withdiff_type: modifiedandmetadata_changed:fields.snapm/fsdiff/options.py (1)
43-44: LGTM! Option renamed to clarify magic-based detection.The rename from
include_file_typetouse_magic_file_typebetter communicates the behaviour - whenTrue, the libmagic library is used for detection; whenFalse(default), the best-effort guessing path is used. This aligns with PR objective #809.snapm/command.py (1)
2718-2718: LGTM! Clean rename aligning with updated option naming.The destination variable rename from
include_file_typetouse_magic_file_typecorrectly reflects the updated naming convention whilst preserving the user-facing-f/--file-typesoption unchanged.snapm/fsdiff/treewalk.py (5)
55-94: Excellent comprehensive hazard mitigation.The
_ALWAYS_EXCLUDE_PATTERNSconstant provides thorough coverage of dangerous system paths with clear categorisation and inline documentation. The patterns correctly address known hazards including:
- Blocking streams (/proc/kmsg, /dev/console)
- Infinite data sources (/dev/zero, /dev/random)
- System memory access (/proc/kcore, /dev/mem)
- Hardware triggers (/dev/watchdog*)
The glob patterns are appropriate for use with
fnmatch.
243-243: LGTM! Improved output formatting.The formatting change to print the stat block on a new line enhances readability of the
FsEntrystring representation.
463-465: LGTM! Consistent file type detection for directories.File type detection is now always invoked for directories with the configurable
use_magicflag, enabling best-effort type guessing when libmagic is unavailable. This aligns with PR objectives #801 and #803.
482-484: LGTM! Consistent file type detection for regular files.File type detection is now always invoked for regular files with the configurable
use_magicflag, mirroring the directory handling and ensuring consistent type information throughout the tree walk.
581-586: LGTM! Correct implementation of always-exclude logic.The combination of
_ALWAYS_EXCLUDE_PATTERNSwith user-provided exclusion patterns correctly implements the "hard exclude" behaviour that cannot be bypassed by--include-system-dirs. The patterns are properly applied usingfnmatchto filter paths during tree traversal.snapm/fsdiff/engine.py (5)
86-86: LGTM! Consistent field initialization pattern.The
file_type_descfield initialization follows the established pattern of using a helper method, maintaining consistency withfile_typeandfile_categoryinitialization.
261-278: LGTM! Well-structured file type description logic.The
_get_file_type_desc()method provides clear human-readable descriptions with appropriate fallback behaviour:
- Special handling for directories and symbolic links
- Leverages
file_type_info.descriptionwhen available- Graceful fallback to "unknown" for edge cases
The descriptions enhance user-facing output quality as intended by PR objective #808.
168-168: LGTM! Logical field placement in string representation.The
file_type_descfield is appropriately positioned afterfile_typein the string representation, maintaining logical grouping of related information.
199-199: LGTM! Complete API surface coverage.The
file_type_descfield is correctly included in the dictionary representation, ensuring the field is available in JSON output and maintaining consistency with the string representation.
661-677: LGTM! Improved output consistency and completeness.The updates to
short()method enhance output quality:
- Field labels now match actual attribute names (
content_diff_summary,changes)- Includes the new
file_type_descfield for richer type information- Maintains clean, readable formatting
These changes align with PR objective #808 for consistent full/short formatting.
516e7b5 to
2aa4499
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (3)
snapm/fsdiff/filetypes.py (3)
475-508: Typo in docstring.Line 492 contains a typo: "desciption" should be "description".
🔎 Proposed fix
- :returns: A 3-tuple containing (mime_type, desciption, encoding) if the + :returns: A 3-tuple containing (mime_type, description, encoding) if the type could be guessed or ``None`` otherwise.
556-574: Typo in docstring.Line 563 contains a typo: "desciption" should be "description".
🔎 Proposed fix
- :returns: A 3-tuple containing (mime_type, desciption, encoding). + :returns: A 3-tuple containing (mime_type, description, encoding).
511-524: Typo in docstring.Line 518 contains a typo: "desciption" should be "description".
🔎 Proposed fix
- :returns: A 3-tuple containing (mime_type, desciption, encoding) if the + :returns: A 3-tuple containing (mime_type, description, encoding) if the type could be guessed or ``None`` otherwise.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
scripts/difftest.py(1 hunks)snapm/command.py(1 hunks)snapm/fsdiff/engine.py(5 hunks)snapm/fsdiff/filetypes.py(5 hunks)snapm/fsdiff/options.py(1 hunks)snapm/fsdiff/treewalk.py(5 hunks)tests/fsdiff/test_engine.py(1 hunks)tests/fsdiff/test_filetypes.py(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- snapm/command.py
- scripts/difftest.py
- tests/fsdiff/test_engine.py
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 541
File: snapm/manager/_mounts.py:174-197
Timestamp: 2025-10-22T20:06:54.993Z
Learning: For snapshotmanager/snapm follow-up requests from bmr-cymru, create a GitHub issue with label "enhancement", set/ensure a "type: Task" label, and assign it to bmr-cymru. Include backlinks to the PR and the discussion comment in the issue body.
📚 Learning: 2025-12-10T18:28:51.721Z
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 0
File: :0-0
Timestamp: 2025-12-10T18:28:51.721Z
Learning: In the snapshotmanager/snapm codebase, `TreeWalker.walk_tree()` in `snapm/_fsdiff/treewalk.py` eagerly materializes all paths before processing (using `os.walk`) to compute the total count for progress reporting. This design is intentional and necessary because accurate progress updates require knowing the total upfront. The custom `snapm._progress` implementation is lightweight (~2000 lines with tests) and RHEL-compatible, unlike alternatives like `tqdm` (>8000 lines, not in RHEL). The eager materialization trade-off is well-justified for the UX benefit of progress reporting on large trees.
Applied to files:
snapm/fsdiff/treewalk.py
📚 Learning: 2025-12-02T16:08:02.588Z
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 0
File: :0-0
Timestamp: 2025-12-02T16:08:02.588Z
Learning: In the snapshotmanager/snapm codebase, `FileTypeInfo` in `snapm/_fsdiff/filetypes.py` is not suitable for conversion to `dataclass` because it requires custom `__init__` logic to compute `is_text_like` from the `category` parameter, and has a custom `__str__` method that formats field names as human-readable strings (e.g., "MIME type:"). The project uses `dataclass` extensively for classes like `SnapmConfig` and `GcPolicyParams*`, but only where it doesn't interfere with custom initialization or formatting requirements.
Applied to files:
snapm/fsdiff/options.pysnapm/fsdiff/filetypes.pysnapm/fsdiff/engine.py
📚 Learning: 2025-12-14T12:52:14.459Z
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 707
File: snapm/fsdiff/engine.py:327-405
Timestamp: 2025-12-14T12:52:14.459Z
Learning: In snapm/fsdiff/engine.py, render_unified_diff intentionally uses tc.WHITE instead of tc.NORMAL to reset colors after diff lines. This avoids breaking less -R output when piping through files, and is intended to support --color=always with less -R. When reviewing changes in this file, ensure any color reset logic preserves compatibility with downstream pagers/filters; if you introduce color resets, test with --color=always and piping to less -R. If you modify reset color behavior, consider updating tests and documenting rationale.
Applied to files:
snapm/fsdiff/engine.py
🧬 Code graph analysis (3)
snapm/fsdiff/treewalk.py (1)
snapm/fsdiff/filetypes.py (1)
detect_file_type(753-786)
snapm/fsdiff/engine.py (1)
snapm/fsdiff/treewalk.py (2)
is_dir(310-318)is_symlink(321-329)
tests/fsdiff/test_filetypes.py (1)
snapm/fsdiff/filetypes.py (1)
detect_file_type(753-786)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (39)
- GitHub Check: rpm-build:fedora-42-x86_64
- GitHub Check: rpm-build:centos-stream-9-x86_64
- GitHub Check: rpm-build:centos-stream-10-x86_64
- GitHub Check: rpm-build:fedora-rawhide-x86_64
- GitHub Check: rpm-build:fedora-43-x86_64
- GitHub Check: rpm-build:centos-stream-9-x86_64
- GitHub Check: rpm-build:centos-stream-10-x86_64
- GitHub Check: rpm-build:fedora-42-x86_64
- GitHub Check: rpm-build:fedora-rawhide-x86_64
- GitHub Check: rpm-build:epel-9-x86_64
- GitHub Check: rpm-build:fedora-43-x86_64
- GitHub Check: rpm-build:fedora-rawhide-x86_64
- GitHub Check: rpm-build:fedora-42-x86_64
- GitHub Check: rpm-build:epel-9-x86_64
- GitHub Check: rpm-build:fedora-43-x86_64
- GitHub Check: rpm-build:centos-stream-9-x86_64
- GitHub Check: rpm-build:centos-stream-10-x86_64
- GitHub Check: rpm-build:fedora-43-x86_64
- GitHub Check: rpm-build:centos-stream-9-x86_64
- GitHub Check: rpm-build:epel-9-x86_64
- GitHub Check: rpm-build:fedora-42-x86_64
- GitHub Check: rpm-build:fedora-rawhide-x86_64
- GitHub Check: rpm-build:centos-stream-10-x86_64
- GitHub Check: virt_tests (bios, lvm-thin, fedora41)
- GitHub Check: virt_tests (bios, lvm-thin, fedora42)
- GitHub Check: virt_tests (bios, lvm-thin, centos-stream10)
- GitHub Check: virt_tests (uefi, lvm, fedora42)
- GitHub Check: virt_tests (bios, lvm, centos-stream10)
- GitHub Check: virt_tests (uefi, lvm, fedora41)
- GitHub Check: virt_tests (uefi, lvm-thin, fedora42)
- GitHub Check: virt_tests (uefi, lvm-thin, centos-stream9)
- GitHub Check: virt_tests (bios, lvm, fedora41)
- GitHub Check: virt_tests (uefi, lvm, centos-stream9)
- GitHub Check: virt_tests (bios, lvm, centos-stream9)
- GitHub Check: virt_tests (bios, lvm, fedora42)
- GitHub Check: virt_tests (uefi, lvm-thin, centos-stream10)
- GitHub Check: virt_tests (bios, lvm-thin, centos-stream9)
- GitHub Check: virt_tests (uefi, lvm, centos-stream10)
- GitHub Check: python-basic-tests
🔇 Additional comments (14)
tests/fsdiff/test_filetypes.py (1)
34-34: Tests correctly updated for new API.The test call sites properly pass
use_magic=Trueto exercise the magic-based file type detection path, aligning with the new parameter introduced indetect_file_type.Also applies to: 47-47, 75-75
snapm/fsdiff/options.py (1)
44-44: Field rename correctly implemented.The rename from
include_file_typetouse_magic_file_typeclearly reflects the distinction between magic-based and non-magic file type detection, improving API clarity.snapm/fsdiff/treewalk.py (3)
55-94: Excellent hard exclusion patterns.The
_ALWAYS_EXCLUDE_PATTERNScomprehensively covers hazardous filesystem paths that should never be traversed, including blocking streams, watchdogs, and debug interfaces. The categorization and inline comments are helpful.
463-465: File type detection consistently updated.Both
_process_dirand_process_filenow unconditionally compute file type information, passing theuse_magicparameter from options. This ensures consistent behaviour across the codebase.Also applies to: 482-484
581-586: Exclusion logic correctly implements hard exclusions.The combination of
_ALWAYS_EXCLUDE_PATTERNSwithself.exclude_patternsensures that dangerous paths are always excluded, regardless of user configuration. Thefnmatchcheck correctly applies both sets of patterns.snapm/fsdiff/filetypes.py (6)
32-472: Comprehensive file type mappings added.The extensive text and binary file type mappings provide excellent coverage for best-effort file type detection without libmagic. The organization by category (general text, web standards, source code, archives, media, etc.) makes the mappings maintainable.
594-596: New enum variants correctly added.The
SOURCE_CODE,CERTIFICATE, andSYMLINKcategory additions provide better file type granularity as specified in issue #805.
658-751: Excellent expansion of MIME type categorization.The expanded
category_rulesprovide comprehensive coverage of MIME types across archives, executables, documents, configurations, databases, source code, and certificates. This significantly improves categorization accuracy.
753-786: Dual detection paths correctly implemented.The
use_magicparameter enables both magic-based and best-effort non-magic file type detection. The magic path preserves existing error handling, while the non-magic path provides a fallback. The default ofFalseencourages non-magic detection as intended.
809-811: Systemd unit file detection correctly implemented.The special handling for systemd unit files outside
/etcensures they are correctly categorized asCONFIGbased on their extension.
820-834: Non-magic detection method correctly implemented.The
_guess_file_typemethod properly integrates the non-magic guessing helpers with the existing categorization logic, returning a completeFileTypeInfostructure.snapm/fsdiff/engine.py (3)
86-86: File type description field correctly implemented.The
file_type_descfield and_get_file_type_desc()helper provide human-readable file type descriptions, with appropriate special cases for directories and symlinks, and sensible fallbacks. This enhances reporting clarity as specified in issue #808.Also applies to: 261-277
168-168: File type description properly exposed in outputs.The
file_type_descfield is correctly included in both the string representation and dictionary serialization, maintaining consistency with other reporting fields.Also applies to: 199-199
661-676: Short format consistently updated with new fields.The updated field labels (
content_diff_summary,changes) and inclusion offile_type_descimprove output clarity and consistency as specified in issue #808.
2aa4499 to
d4be023
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
tests/fsdiff/test_filetypes.py (1)
80-83: Consider adding tests for the non-magic guessing path.The tests cover the
use_magic=Truepath, but there appear to be no tests for the new_guess_file_type()fallback path (whenuse_magic=False). Consider adding tests that verify extension-based and path-based guessing works correctly.Would you like me to open an issue to track adding test coverage for the non-magic file type guessing functionality?
snapm/fsdiff/filetypes.py (2)
285-439: Note:.tsextension conflict with TypeScript.The
.tsextension at line 371 is mapped to"video/mp2t"(MPEG Transport Stream), but it's also commonly used for TypeScript source files (mapped inTEXT_EXTENSION_MAPat line 92). Since_guess_filechecks binary extensions first, TypeScript files would be misidentified as video files.This is an inherent ambiguity with the
.tsextension. For a best-effort guesser, you might consider:
- Removing
.tsfromBINARY_EXTENSION_MAP(TypeScript is more common in development contexts)- Adding path-based heuristics (e.g., files near
package.jsonare likely TypeScript)- Documenting this limitation
820-834: Add return type annotation to function signature.The docstring correctly specifies `:rtype: ``FileTypeInfo```, but the function signature is missing the return type annotation for consistency with the rest of the codebase.
🔎 Proposed fix
- def _guess_file_type(self, file_path: Path): + def _guess_file_type(self, file_path: Path) -> FileTypeInfo:
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
scripts/difftest.py(1 hunks)snapm/command.py(1 hunks)snapm/fsdiff/engine.py(5 hunks)snapm/fsdiff/filetypes.py(5 hunks)snapm/fsdiff/options.py(1 hunks)snapm/fsdiff/treewalk.py(5 hunks)tests/fsdiff/test_engine.py(1 hunks)tests/fsdiff/test_filetypes.py(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- tests/fsdiff/test_engine.py
- snapm/fsdiff/treewalk.py
- snapm/command.py
- scripts/difftest.py
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 541
File: snapm/manager/_mounts.py:174-197
Timestamp: 2025-10-22T20:06:54.993Z
Learning: For snapshotmanager/snapm follow-up requests from bmr-cymru, create a GitHub issue with label "enhancement", set/ensure a "type: Task" label, and assign it to bmr-cymru. Include backlinks to the PR and the discussion comment in the issue body.
📚 Learning: 2025-12-02T16:08:02.588Z
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 0
File: :0-0
Timestamp: 2025-12-02T16:08:02.588Z
Learning: In the snapshotmanager/snapm codebase, `FileTypeInfo` in `snapm/_fsdiff/filetypes.py` is not suitable for conversion to `dataclass` because it requires custom `__init__` logic to compute `is_text_like` from the `category` parameter, and has a custom `__str__` method that formats field names as human-readable strings (e.g., "MIME type:"). The project uses `dataclass` extensively for classes like `SnapmConfig` and `GcPolicyParams*`, but only where it doesn't interfere with custom initialization or formatting requirements.
Applied to files:
snapm/fsdiff/options.pysnapm/fsdiff/filetypes.pysnapm/fsdiff/engine.py
📚 Learning: 2025-10-22T20:06:54.993Z
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 541
File: snapm/manager/_mounts.py:174-197
Timestamp: 2025-10-22T20:06:54.993Z
Learning: For snapshotmanager/snapm follow-up requests from bmr-cymru, create a GitHub issue with label "enhancement", set/ensure a "type: Task" label, and assign it to bmr-cymru. Include backlinks to the PR and the discussion comment in the issue body.
Applied to files:
snapm/fsdiff/filetypes.py
📚 Learning: 2025-12-14T12:52:14.459Z
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 707
File: snapm/fsdiff/engine.py:327-405
Timestamp: 2025-12-14T12:52:14.459Z
Learning: In snapm/fsdiff/engine.py, render_unified_diff intentionally uses tc.WHITE instead of tc.NORMAL to reset colors after diff lines. This avoids breaking less -R output when piping through files, and is intended to support --color=always with less -R. When reviewing changes in this file, ensure any color reset logic preserves compatibility with downstream pagers/filters; if you introduce color resets, test with --color=always and piping to less -R. If you modify reset color behavior, consider updating tests and documenting rationale.
Applied to files:
snapm/fsdiff/engine.py
🧬 Code graph analysis (2)
tests/fsdiff/test_filetypes.py (1)
snapm/fsdiff/filetypes.py (1)
detect_file_type(753-786)
snapm/fsdiff/engine.py (1)
snapm/fsdiff/treewalk.py (2)
is_dir(310-318)is_symlink(321-329)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (30)
- GitHub Check: rpm-build:epel-9-x86_64
- GitHub Check: rpm-build:centos-stream-9-x86_64
- GitHub Check: rpm-build:fedora-rawhide-x86_64
- GitHub Check: rpm-build:fedora-42-x86_64
- GitHub Check: rpm-build:centos-stream-10-x86_64
- GitHub Check: rpm-build:fedora-rawhide-x86_64
- GitHub Check: rpm-build:fedora-42-x86_64
- GitHub Check: rpm-build:epel-9-x86_64
- GitHub Check: rpm-build:fedora-43-x86_64
- GitHub Check: rpm-build:centos-stream-9-x86_64
- GitHub Check: rpm-build:centos-stream-10-x86_64
- GitHub Check: rpm-build:fedora-43-x86_64
- GitHub Check: rpm-build:epel-9-x86_64
- GitHub Check: rpm-build:centos-stream-10-x86_64
- GitHub Check: rpm-build:centos-stream-9-x86_64
- GitHub Check: rpm-build:fedora-rawhide-x86_64
- GitHub Check: rpm-build:fedora-42-x86_64
- GitHub Check: rpm-build:centos-stream-10-x86_64
- GitHub Check: rpm-build:centos-stream-9-x86_64
- GitHub Check: rpm-build:fedora-43-x86_64
- GitHub Check: rpm-build:fedora-rawhide-x86_64
- GitHub Check: rpm-build:fedora-42-x86_64
- GitHub Check: rpm-build:epel-9-x86_64
- GitHub Check: virt_tests (uefi, lvm-thin, fedora41)
- GitHub Check: virt_tests (bios, lvm, centos-stream10)
- GitHub Check: virt_tests (bios, lvm-thin, centos-stream9)
- GitHub Check: virt_tests (uefi, lvm, centos-stream9)
- GitHub Check: virt_tests (bios, lvm-thin, fedora42)
- GitHub Check: virt_tests (uefi, lvm-thin, fedora42)
- GitHub Check: python-basic-tests
🔇 Additional comments (17)
snapm/fsdiff/options.py (1)
43-44: LGTM!The field rename from
include_file_typetouse_magic_file_typeaccurately reflects the new semantics whereFalsetriggers best-effort guessing andTrueuses libmagic. The comment clearly describes the behaviour.tests/fsdiff/test_filetypes.py (3)
24-36: LGTM!The test correctly passes
use_magic=Trueto exercise the magic-based detection path, which aligns with the updated API where the default is nowFalse. The mock setup and assertions remain valid.
38-49: LGTM!Consistent update to use
use_magic=Truefor the JSON config categorisation test.
67-78: LGTM!Error handling test correctly updated to use
use_magic=Truesince the error path only applies when libmagic is used.snapm/fsdiff/engine.py (4)
86-86: LGTM!The new
file_type_descfield is initialised correctly alongside the existingfile_typefield, providing a human-readable description for reporting.
261-277: LGTM!The
_get_file_type_descmethod follows the same pattern as_get_file_typeand correctly returns descriptive strings for directories ("filesystem directory"), symlinks ("symbolic link"), and falls back tofile_type_info.descriptionwhen available.
186-199: LGTM!The
to_dictmethod correctly includes the newfile_type_descfield, maintaining consistency with__str__output.
650-679: LGTM!The
short()method output is now more consistent withfull()by using matching field labels (content_diff_summary,changes,diff_type,file_type,file_type_desc). This addresses objective #808 for consistent full/short formatting.snapm/fsdiff/filetypes.py (9)
32-242: LGTM!The
TEXT_EXTENSION_MAPprovides comprehensive coverage for text-based file types including source code, configuration files, documentation, and web standards. The format is consistent with 2-tuples of (mime_type, description).
244-268: LGTM!The
TEXT_FILENAME_MAPhandles common extensionless files correctly, including build files, release files, and documentation.
270-283: LGTM!The
SYSTEMD_UNIT_EXTENSIONStuple provides a clear list of systemd unit file extensions for special handling in_categorize_file.
475-508: LGTM!The
_generic_guess_filefunction is well-designed with clear parameters for extension and filename maps. The docstring correctly documents the 3-tuple return type.
545-553: LGTM!The logic correctly handles the edge case where text files (like shell scripts or config files) exist in binary directories. Converting
abs_parent_pathto string before the dictionary lookup addresses the type mismatch issue flagged in earlier reviews.
556-574: LGTM!The
_guess_filefunction provides a sensible fallback chain: binary guess → text guess → default octet-stream. This ensures a valid 3-tuple is always returned.
594-596: LGTM!The new
SOURCE_CODE,CERTIFICATE, andSYMLINKenum values align with objective #805 and are properly integrated intocategory_rules.
658-751: LGTM!The expanded
category_rulesprovides comprehensive MIME-type categorisation. The# fmt: off/oncomments are appropriate for maintaining readability of the dictionary structure.
753-786: LGTM!The
detect_file_typemethod correctly implements the dual-path detection:
use_magic=True: Uses libmagic with proper error handling for platform variations (c9s magic.error)use_magic=False: Falls back to_guess_file_typeThe default
use_magic=Falsealigns with the PR objective to provide best-effort detection without requiring libmagic.
236d899 to
122d66e
Compare
|
@coderabbitai: full review |
|
@bmr-cymru: I'll conduct a full review of all the changes in this PR. 🧠 Learnings used✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tests/fsdiff/test_filetypes.py (1)
24-37: Tests now explicitly exercise the magic-based detection pathThe updated calls to
detect_file_type(..., use_magic=True)keep these tests focused on the libmagic code path, and the error-handling test remains guarded correctly byhasattr(magic, "error"). You may want a small additional test for the non-magic guesser in future, but it is not required for this PR.Also applies to: 38-50, 68-79
snapm/fsdiff/filetypes.py (1)
577-602: Source-code categorisation and text-likeness heuristics could be sharpenedThe new
SOURCE_CODE,CERTIFICATE, andSYMLINKcategories and the expandedcategory_rulesare a nice improvement, but two aspects slightly undermine their effect:
- In
category_rules, the generic"text/"prefix rule appears before the more specifictext/x-*source-code entries. Because you iterate in insertion order and usemime_type.startswith(pattern), anytext/x-python,text/x-c, etc. MIME will be classified asTEXTbefore theSOURCE_CODErules are considered.FileTypeInfo.is_text_likeonly treatsTEXT,CONFIG, andLOGas text-like. Any MIME now classified asSOURCE_CODE(e.g.application/javascriptor similar) will be considered non-text, which may affect any callers relying onis_text_like(for example, to decide when to generate unified vs binary diffs).If you want
SOURCE_CODEto behave as a distinct but still text-like category, you could:
- Move the generic
"text/"rule to the end ofcategory_rules, after the more specifictext/x-*andtext/html/text/cssentries, or split rules into an "exact" vs "prefix" set and check exact matches first.- Include
FileTypeCategory.SOURCE_CODE(and possiblyDOCUMENT) in theis_text_likeset.This is behavioural polish rather than a blocker, so it can be deferred, but it will make the new categories more effective.
Also applies to: 658-751, 788-819, 820-834
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
scripts/difftest.py(1 hunks)snapm/command.py(1 hunks)snapm/fsdiff/engine.py(5 hunks)snapm/fsdiff/filetypes.py(5 hunks)snapm/fsdiff/options.py(1 hunks)snapm/fsdiff/treewalk.py(5 hunks)tests/fsdiff/test_engine.py(1 hunks)tests/fsdiff/test_filetypes.py(3 hunks)
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 541
File: snapm/manager/_mounts.py:174-197
Timestamp: 2025-10-22T20:06:54.993Z
Learning: For snapshotmanager/snapm follow-up requests from bmr-cymru, create a GitHub issue with label "enhancement", set/ensure a "type: Task" label, and assign it to bmr-cymru. Include backlinks to the PR and the discussion comment in the issue body.
📚 Learning: 2025-12-02T16:08:02.588Z
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 0
File: :0-0
Timestamp: 2025-12-02T16:08:02.588Z
Learning: In the snapshotmanager/snapm codebase, `FileTypeInfo` in `snapm/_fsdiff/filetypes.py` is not suitable for conversion to `dataclass` because it requires custom `__init__` logic to compute `is_text_like` from the `category` parameter, and has a custom `__str__` method that formats field names as human-readable strings (e.g., "MIME type:"). The project uses `dataclass` extensively for classes like `SnapmConfig` and `GcPolicyParams*`, but only where it doesn't interfere with custom initialization or formatting requirements.
Applied to files:
snapm/fsdiff/options.pysnapm/fsdiff/engine.pysnapm/fsdiff/filetypes.pytests/fsdiff/test_filetypes.py
📚 Learning: 2025-12-14T12:52:14.459Z
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 707
File: snapm/fsdiff/engine.py:327-405
Timestamp: 2025-12-14T12:52:14.459Z
Learning: In snapm/fsdiff/engine.py, render_unified_diff intentionally uses tc.WHITE instead of tc.NORMAL to reset colors after diff lines. This avoids breaking less -R output when piping through files, and is intended to support --color=always with less -R. When reviewing changes in this file, ensure any color reset logic preserves compatibility with downstream pagers/filters; if you introduce color resets, test with --color=always and piping to less -R. If you modify reset color behavior, consider updating tests and documenting rationale.
Applied to files:
snapm/fsdiff/engine.py
📚 Learning: 2025-10-22T20:06:54.993Z
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 541
File: snapm/manager/_mounts.py:174-197
Timestamp: 2025-10-22T20:06:54.993Z
Learning: For snapshotmanager/snapm follow-up requests from bmr-cymru, create a GitHub issue with label "enhancement", set/ensure a "type: Task" label, and assign it to bmr-cymru. Include backlinks to the PR and the discussion comment in the issue body.
Applied to files:
snapm/fsdiff/filetypes.py
📚 Learning: 2025-12-10T18:28:51.721Z
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 0
File: :0-0
Timestamp: 2025-12-10T18:28:51.721Z
Learning: In the snapshotmanager/snapm codebase, `TreeWalker.walk_tree()` in `snapm/_fsdiff/treewalk.py` eagerly materializes all paths before processing (using `os.walk`) to compute the total count for progress reporting. This design is intentional and necessary because accurate progress updates require knowing the total upfront. The custom `snapm._progress` implementation is lightweight (~2000 lines with tests) and RHEL-compatible, unlike alternatives like `tqdm` (>8000 lines, not in RHEL). The eager materialization trade-off is well-justified for the UX benefit of progress reporting on large trees.
Applied to files:
snapm/fsdiff/treewalk.py
📚 Learning: 2025-09-24T02:49:21.752Z
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 490
File: snapm/manager/plugins/lvm2.py:618-629
Timestamp: 2025-09-24T02:49:21.752Z
Learning: When API changes affect return types (like changing from int to tuple), all supporting infrastructure needs updating including mock scripts in tests, not just the call sites that consume the API.
Applied to files:
tests/fsdiff/test_engine.py
🧬 Code graph analysis (4)
snapm/fsdiff/engine.py (1)
snapm/fsdiff/treewalk.py (2)
is_dir(310-318)is_symlink(321-329)
snapm/fsdiff/treewalk.py (1)
snapm/fsdiff/filetypes.py (1)
detect_file_type(753-786)
tests/fsdiff/test_engine.py (1)
snapm/fsdiff/engine.py (1)
short(650-679)
tests/fsdiff/test_filetypes.py (1)
snapm/fsdiff/filetypes.py (1)
detect_file_type(753-786)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (24)
- GitHub Check: rpm-build:fedora-42-x86_64
- GitHub Check: rpm-build:epel-9-x86_64
- GitHub Check: rpm-build:centos-stream-10-x86_64
- GitHub Check: rpm-build:centos-stream-9-x86_64
- GitHub Check: rpm-build:fedora-rawhide-x86_64
- GitHub Check: rpm-build:fedora-43-x86_64
- GitHub Check: rpm-build:fedora-43-x86_64
- GitHub Check: rpm-build:fedora-rawhide-x86_64
- GitHub Check: rpm-build:centos-stream-10-x86_64
- GitHub Check: rpm-build:fedora-42-x86_64
- GitHub Check: rpm-build:centos-stream-9-x86_64
- GitHub Check: rpm-build:epel-9-x86_64
- GitHub Check: rpm-build:fedora-43-x86_64
- GitHub Check: rpm-build:fedora-rawhide-x86_64
- GitHub Check: rpm-build:centos-stream-10-x86_64
- GitHub Check: rpm-build:fedora-42-x86_64
- GitHub Check: rpm-build:centos-stream-9-x86_64
- GitHub Check: rpm-build:epel-9-x86_64
- GitHub Check: rpm-build:fedora-42-x86_64
- GitHub Check: rpm-build:centos-stream-9-x86_64
- GitHub Check: rpm-build:fedora-rawhide-x86_64
- GitHub Check: rpm-build:fedora-43-x86_64
- GitHub Check: rpm-build:centos-stream-10-x86_64
- GitHub Check: rpm-build:epel-9-x86_64
🔇 Additional comments (7)
snapm/fsdiff/engine.py (1)
82-101: File type description plumbing in FsDiffRecord/FsDiffResults looks sound
file_type_descis derived once in__init__with sensible fallbacks (directory, symbolic link, description fromfile_type_info, else "unknown"), is included consistently in__str__,to_dict(), and theshort()formatter, and does not alter existing control flow for diffs or summaries. This gives more human-friendly output without changing diff semantics.Also applies to: 157-183, 186-229, 243-260, 261-278, 315-337, 650-679
tests/fsdiff/test_engine.py (1)
72-81: Tests correctly track updated diff output fieldsThe short/full output expectations now align with the new labels (
diff_typeandcontent_diff_summary) and exercise the modified formatting without over-constraining surrounding text.scripts/difftest.py (1)
27-35: CLI wiring for-f/--file-typesmatches new DiffOptions flagThe argparse destination now matches
DiffOptions.use_magic_file_type, sodifftest.pywill correctly toggle magic-based detection without changing the user-visible option.snapm/fsdiff/treewalk.py (2)
55-105: Hard excludes are applied safely and cannot be bypassed
_ALWAYS_EXCLUDE_PATTERNScovers the right set of pathological/proc,/dev, and/sysentries, and thefnmatchcheck againststripped_pathnamehappens before anystat()or content access. Combining these withself.exclude_patternsmeans they are enforced even wheninclude_system_dirs=True, while still honouring user excludes elsewhere. This is a good balance between safety and configurability.Also applies to: 580-593
450-498: Always computing file_type_info aligns TreeWalker with the new reporting model
_process_dirand_process_filenow always calldetect_file_type(magic vs guess controlled byoptions.use_magic_file_type), which ensures everyFsEntryhasfile_type_infofilled in for use byFsDiffRecord(file_type/file_type_descand category). This slightly increases per-entry work but keeps magic usage fully opt-in and uses lightweight name-based guessing otherwise, so it should be acceptable for large walks.Also applies to: 621-643
snapm/command.py (1)
2698-2721: CLI option now correctly targetsuse_magic_file_typeThe
-f/--file-typesoption forsnapm snapset diffanddiffreportnow properly populatesuse_magic_file_type, aligning with theDiffOptionsfield andFileTypeDetector.detect_file_typemethod. The migration is complete with no remaining references to the previous field name.snapm/fsdiff/options.py (1)
31-45: DiffOptions flag rename is clean and complete; no stale references detectedThe
use_magic_file_typefield integrates cleanly withfrom_cmd_args, which pulls values by dataclass field names. Verification confirms the argparse destination at line 2718 ofsnapm/command.pycorrectly usesdest="use_magic_file_type", matching the dataclass field exactly. No remaining references to the oldinclude_file_typename exist anywhere in the codebase.
122d66e to
bb3661c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
snapm/fsdiff/filetypes.py (1)
647-656: Clarify operator precedence in is_text_like expression.The expression mixing
andandoroperators lacks explicit parentheses, making the intended precedence unclear:self.is_text_like = category in (...) or ( category == FileTypeCategory.DOCUMENT and mime_type.startswith("text/") or mime_type in self.TEXT_DOCUMENTS )Due to operator precedence,
andbinds tighter thanor, so the inner expression is parsed as:(category == FileTypeCategory.DOCUMENT and mime_type.startswith("text/")) or (mime_type in self.TEXT_DOCUMENTS)This means any file with a MIME type in
TEXT_DOCUMENTSwill be marked text-like regardless of its category. If the intent is that TEXT_DOCUMENTS should only apply to DOCUMENT category files, add explicit parentheses.🔎 Proposed fix if TEXT_DOCUMENTS should be category-agnostic
self.is_text_like = category in ( FileTypeCategory.TEXT, FileTypeCategory.CONFIG, FileTypeCategory.LOG, FileTypeCategory.SOURCE_CODE, -) or ( - category == FileTypeCategory.DOCUMENT - and mime_type.startswith("text/") - or mime_type in self.TEXT_DOCUMENTS -) +) or (category == FileTypeCategory.DOCUMENT and mime_type.startswith("text/")) \ + or mime_type in self.TEXT_DOCUMENTSOr if TEXT_DOCUMENTS should only apply to DOCUMENT category:
self.is_text_like = category in ( FileTypeCategory.TEXT, FileTypeCategory.CONFIG, FileTypeCategory.LOG, FileTypeCategory.SOURCE_CODE, ) or ( category == FileTypeCategory.DOCUMENT - and mime_type.startswith("text/") - or mime_type in self.TEXT_DOCUMENTS + and (mime_type.startswith("text/") or mime_type in self.TEXT_DOCUMENTS) )
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
scripts/difftest.py(1 hunks)snapm/command.py(1 hunks)snapm/fsdiff/engine.py(5 hunks)snapm/fsdiff/filetypes.py(7 hunks)snapm/fsdiff/options.py(1 hunks)snapm/fsdiff/treewalk.py(5 hunks)tests/fsdiff/test_engine.py(1 hunks)tests/fsdiff/test_filetypes.py(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- tests/fsdiff/test_engine.py
- snapm/fsdiff/treewalk.py
- scripts/difftest.py
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 541
File: snapm/manager/_mounts.py:174-197
Timestamp: 2025-10-22T20:06:54.993Z
Learning: For snapshotmanager/snapm follow-up requests from bmr-cymru, create a GitHub issue with label "enhancement", set/ensure a "type: Task" label, and assign it to bmr-cymru. Include backlinks to the PR and the discussion comment in the issue body.
📚 Learning: 2025-12-02T16:08:02.588Z
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 0
File: :0-0
Timestamp: 2025-12-02T16:08:02.588Z
Learning: In the snapshotmanager/snapm codebase, `FileTypeInfo` in `snapm/_fsdiff/filetypes.py` is not suitable for conversion to `dataclass` because it requires custom `__init__` logic to compute `is_text_like` from the `category` parameter, and has a custom `__str__` method that formats field names as human-readable strings (e.g., "MIME type:"). The project uses `dataclass` extensively for classes like `SnapmConfig` and `GcPolicyParams*`, but only where it doesn't interfere with custom initialization or formatting requirements.
Applied to files:
snapm/fsdiff/options.pysnapm/fsdiff/filetypes.pysnapm/fsdiff/engine.pytests/fsdiff/test_filetypes.py
📚 Learning: 2025-10-22T20:06:54.993Z
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 541
File: snapm/manager/_mounts.py:174-197
Timestamp: 2025-10-22T20:06:54.993Z
Learning: For snapshotmanager/snapm follow-up requests from bmr-cymru, create a GitHub issue with label "enhancement", set/ensure a "type: Task" label, and assign it to bmr-cymru. Include backlinks to the PR and the discussion comment in the issue body.
Applied to files:
snapm/fsdiff/filetypes.py
📚 Learning: 2025-12-14T12:52:14.459Z
Learnt from: bmr-cymru
Repo: snapshotmanager/snapm PR: 707
File: snapm/fsdiff/engine.py:327-405
Timestamp: 2025-12-14T12:52:14.459Z
Learning: In snapm/fsdiff/engine.py, render_unified_diff intentionally uses tc.WHITE instead of tc.NORMAL to reset colors after diff lines. This avoids breaking less -R output when piping through files, and is intended to support --color=always with less -R. When reviewing changes in this file, ensure any color reset logic preserves compatibility with downstream pagers/filters; if you introduce color resets, test with --color=always and piping to less -R. If you modify reset color behavior, consider updating tests and documenting rationale.
Applied to files:
snapm/fsdiff/engine.py
🧬 Code graph analysis (2)
snapm/fsdiff/engine.py (1)
snapm/fsdiff/treewalk.py (2)
is_dir(310-318)is_symlink(321-329)
tests/fsdiff/test_filetypes.py (1)
snapm/fsdiff/filetypes.py (1)
detect_file_type(803-836)
🪛 Ruff (0.14.8)
snapm/fsdiff/filetypes.py
653-654: Parenthesize a and b expressions when chaining and and or together, to make the precedence clear
Parenthesize the and subexpression
(RUF021)
727-727: Dictionary key literal "text/x-tex" repeated
Remove repeated key literal "text/x-tex"
(F601)
761-761: Dictionary key literal "application/x-sh" repeated
Remove repeated key literal "application/x-sh"
(F601)
762-762: Dictionary key literal "application/x-sh" repeated
Remove repeated key literal "application/x-sh"
(F601)
767-767: Dictionary key literal "application/x-csh" repeated
Remove repeated key literal "application/x-csh"
(F601)
769-769: Dictionary key literal "application/x-bat" repeated
Remove repeated key literal "application/x-bat"
(F601)
771-771: Dictionary key literal "text/x-powershell" repeated
Remove repeated key literal "text/x-powershell"
(F601)
772-772: Dictionary key literal "text/x-powershell" repeated
Remove repeated key literal "text/x-powershell"
(F601)
776-776: Dictionary key literal "text/x-perl" repeated
Remove repeated key literal "text/x-perl"
(F601)
777-777: Dictionary key literal "text/x-perl" repeated
Remove repeated key literal "text/x-perl"
(F601)
781-781: Dictionary key literal "text/x-awk" repeated
Remove repeated key literal "text/x-awk"
(F601)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (33)
- GitHub Check: rpm-build:centos-stream-9-x86_64
- GitHub Check: rpm-build:centos-stream-10-x86_64
- GitHub Check: rpm-build:fedora-42-x86_64
- GitHub Check: rpm-build:fedora-rawhide-x86_64
- GitHub Check: rpm-build:epel-9-x86_64
- GitHub Check: rpm-build:fedora-43-x86_64
- GitHub Check: rpm-build:centos-stream-9-x86_64
- GitHub Check: rpm-build:centos-stream-10-x86_64
- GitHub Check: rpm-build:fedora-42-x86_64
- GitHub Check: rpm-build:fedora-rawhide-x86_64
- GitHub Check: rpm-build:epel-9-x86_64
- GitHub Check: rpm-build:fedora-43-x86_64
- GitHub Check: rpm-build:fedora-42-x86_64
- GitHub Check: rpm-build:centos-stream-9-x86_64
- GitHub Check: rpm-build:fedora-rawhide-x86_64
- GitHub Check: rpm-build:fedora-43-x86_64
- GitHub Check: rpm-build:centos-stream-10-x86_64
- GitHub Check: rpm-build:epel-9-x86_64
- GitHub Check: virt_tests (bios, lvm, fedora42)
- GitHub Check: virt_tests (uefi, lvm, centos-stream10)
- GitHub Check: virt_tests (bios, lvm, centos-stream10)
- GitHub Check: virt_tests (bios, lvm, centos-stream9)
- GitHub Check: virt_tests (uefi, lvm-thin, fedora42)
- GitHub Check: virt_tests (uefi, lvm, centos-stream9)
- GitHub Check: virt_tests (bios, lvm-thin, fedora41)
- GitHub Check: virt_tests (bios, lvm-thin, centos-stream9)
- GitHub Check: virt_tests (bios, lvm-thin, centos-stream10)
- GitHub Check: virt_tests (bios, lvm, fedora41)
- GitHub Check: virt_tests (bios, lvm-thin, fedora42)
- GitHub Check: virt_tests (uefi, lvm-thin, centos-stream10)
- GitHub Check: virt_tests (uefi, lvm-thin, centos-stream9)
- GitHub Check: virt_tests (uefi, lvm, fedora42)
- GitHub Check: virt_tests (uefi, lvm, fedora41)
🔇 Additional comments (9)
snapm/command.py (1)
2718-2718: LGTM - CLI argument destination renamed correctly.The destination rename from
include_file_typetouse_magic_file_typeaccurately reflects that libmagic usage is now optional, whilst preserving the existing-f/--file-typesCLI flags.snapm/fsdiff/options.py (1)
44-44: LGTM - DiffOptions field renamed consistently.The field rename from
include_file_typetouse_magic_file_typealigns with the CLI changes and accurately describes the optional nature of libmagic-based detection.tests/fsdiff/test_filetypes.py (1)
34-34: LGTM - Test updates align with new API.The addition of
use_magic=Trueto thedetect_file_type()calls correctly exercises the libmagic-based detection path, ensuring tests validate the magic-enabled workflow.Also applies to: 47-47, 75-75
snapm/fsdiff/engine.py (2)
86-86: LGTM - file_type_desc field integration looks solid.The new
file_type_descfield and_get_file_type_desc()method provide user-friendly file type descriptions. The implementation correctly handles directories, symlinks, and falls back to magic/guessed descriptions, with consistent integration across__str__(),to_dict(), and reporting outputs.Also applies to: 168-168, 199-199, 261-277
661-677: LGTM - short() format updated consistently.The short output now includes
diff_type,file_type, andfile_type_descfields, and correctly uses thecontent_diff_summarykey name (aligned with the field name rather than a generic "change_summary").snapm/fsdiff/filetypes.py (4)
32-253: LGTM - comprehensive file type mappings.The extensive TEXT_EXTENSION_MAP, TEXT_FILENAME_MAP, BINARY_EXTENSION_MAP, BINARY_FILENAME_MAP, and BINARY_FILE_PATHS provide robust coverage for best-effort file type detection. The use of glob patterns in filename maps (e.g.,
"*makefile") is appropriate and will be correctly handled byPath.match().Also applies to: 255-279, 281-294, 296-449, 451-462, 464-482
485-518: LGTM - guessing functions well structured.The helper functions
_generic_guess_file(),_guess_text_file(),_guess_binary_file(), and_guess_file()provide a sensible fallback hierarchy. The use ofPath.match()for glob patterns is correct, and the logic in_guess_binary_file()to honour text patterns even in binary-heavy directories (line 559) is a good design choice.Also applies to: 521-534, 537-563, 566-584
604-606: LGTM - FileTypeCategory extensions.The addition of SOURCE_CODE, CERTIFICATE, and SYMLINK categories, along with TEXT_DOCUMENTS classification and systemd unit file handling, appropriately extends the categorisation system.
Also applies to: 619-622, 860-861
803-836: LGTM - use_magic parameter integration.The updated
detect_file_type()signature withuse_magic=Falseparameter correctly routes to either magic-based detection or the new_guess_file_type()fallback, fulfilling the PR objectives for optional libmagic usage.Also applies to: 870-884
Resolves: #806 Signed-off-by: Bryn M. Reeves <bmr@redhat.com>
…etector Resolves: #803 Signed-off-by: Bryn M. Reeves <bmr@redhat.com>
Resolves: #807 Signed-off-by: Bryn M. Reeves <bmr@redhat.com>
Resolves: #808 Signed-off-by: Bryn M. Reeves <bmr@redhat.com>
Resolves: #804 Signed-off-by: Bryn M. Reeves <bmr@redhat.com>
Resolves: #809 Signed-off-by: Bryn M. Reeves <bmr@redhat.com>
bb3661c to
e4ee2bf
Compare
If we don't have anyFileTypeInfo for an entry the current old_entry/new_entry we fall back to TextContextDiffer: 611 try: 612 differ = ( 613 self.get_differ_for_file(file_type_info) 614 if file_type_info is not None 615 else TextContentDiffer() 616 ) 617 return differ.generate_diff(old_path, new_path, old_entry, new_entry) This is the root cause of the memory pressure problems on low-memory systems that prompted #786, #789, #790, #798, and #800. These are all good-to-haves but the reason we were generating 3GiB RSS on these 4GiB systems was that without -f / --file-type we attempted to generate text content diffs for everything that changed - hundreds of xz compressed kernel modules for e.g. (since an add counts as modified and we diff against /dev/null): Saving cache: 38% [==========------------------] (Saving record 17525) /usr/lib/modules/6.17.12-300.fc43.x86_64/kernel/drivers/infiniband/core/ib_core.ko.xz diff_type: unified old_content: '' new_content: '�7zXZ\x00\x00\x01i"�6\x04�Ч\x0f...' diff_data: <1923 items> summary: 0 deletions, 1923 additions has_changes: True error_message: Saving cache: 38% [==========------------------] (Saving record 17526) /usr/lib/modules/6.17.12-300.fc43.x86_64/kernel/drivers/infiniband/core/ib_umad.ko.xz diff_type: unified old_content: '' new_content: '�7zXZ\x00\x00\x01i"�6\x04���...' diff_data: <349 items> summary: 0 deletions, 349 additions has_changes: True error_message: Saving cache: 38% [==========------------------] (Saving record 17527) /usr/lib/modules/6.17.12-300.fc43.x86_64/kernel/drivers/infiniband/core/ib_uverbs.ko.xz diff_type: unified old_content: '' new_content: '�7zXZ\x00\x00\x01i"�6\x04���...' diff_data: <948 items> summary: 0 deletions, 948 additions has_changes: True error_message: Since we now have magic-less best effort type guessing (#801 / #803) this should be unreachable dead code, but for belts-and-braces let's switch to the BinaryContentDiffer as the fallback default: it is safer since it never attempts to compute actual diffs. Resolves: #819 Signed-off-by: Bryn M. Reeves <bmr@redhat.com>
Implement a best-effort file type guesser when include_file_type is disabled and add "hard exclude" paths that are always skipped even if include_system_dirs=True (e.g. /proc/kmsg). Also extend the MIME-type based categorization to cover more types and add new
FileTypeCategoryenum variants.Resolves: #801
Resolves: #802
Resolves: #803
Resolves: #804
Resolves: #805
Resolves: #806
Resolves: #807
Resolves: #808
Resolves: #809
Resolves: #819
Summary by CodeRabbit
New Features
Improvements
Chores
Tests
✏️ Tip: You can customize this high-level summary in your review settings.