# 609. Find Duplicate File in System


## Topic Alignment
- MLE Connection: Detecting duplicate artifacts mirrors deduplicating feature files or experiment outputs.
- Hash Table Role: Use file content as the key and append full paths to group duplicates.
- Interview Angle: Parses structured strings and applies inverted indexing efficiently.


## Metadata Summary
- Source: https://leetcode.com/problems/find-duplicate-file-in-system/
- Tags: Hash Table, String, File System
- Difficulty: Medium
- Recommended Review Priority: Medium


## Problem Statement
You are given a list of directory info strings. Each string has the form "root/dir file1(content1) file2(content2) ...". Construct a list of groups where each group contains the full paths of files that share identical content. Paths should be constructed as "directory_path/file_name". Only groups with more than one path should be returned.


## Progressive Hints
- Hint 1: Parse each directory string into a root path plus name-content pairs.
- Hint 2: Use file content as the hash key and append the full path to that list.
- Hint 3: After processing all entries, return only the lists where the length is at least two.


## Solution Overview
Parse each directory description, extracting file names and their contents. Build a dictionary keyed by content with values being the list of full file paths. Filter to groups of size greater than one and return them.


## Detailed Explanation
1. For each input string, split on spaces to separate the directory root from the file entries.
2. For each file entry, locate the '(' to split the file name from the content, and remove the trailing ')'.
3. Build the full path using `root + '/' + file_name` and append it to `content_map[content]`.
4. After ingesting all data, iterate over the dictionary values and select only those lists with at least two entries.

This inverted index keyed by content groups duplicates in linear time over the total characters parsed.


## Complexity Trade-off Table
| Approach | Time Complexity | Space Complexity |
| --- | --- | --- |
| Compare every file pair | O(m^2 * L) | O(1) |
| Hash content to paths | O(m * L) | O(m * L) |


## Reference Implementation


In [None]:
from collections import defaultdict
from typing import List


class Solution:
    def findDuplicate(self, paths: List[str]) -> List[List[str]]:
        content_map: defaultdict[str, List[str]] = defaultdict(list)

        for entry in paths:
            parts = entry.split()
            root = parts[0]
            for file_info in parts[1:]:
                name, content = self._parse_file(file_info)
                full_path = f"{root}/{name}"
                content_map[content].append(full_path)

        return [group for group in content_map.values() if len(group) > 1]

    def _parse_file(self, file_info: str) -> tuple[str, str]:
        open_idx = file_info.index('(')
        name = file_info[:open_idx]
        content = file_info[open_idx + 1:-1]
        return name, content


## Complexity Analysis
- Time Complexity: O(m * L) where m is the number of files and L is the average length of each file description.
- Space Complexity: O(m * L) to store content strings and associated paths.
- Bottlenecks: Parsing long directory strings dominates but remains linear overall.


## Edge Cases & Pitfalls
- No group should be returned if a content appears only once.
- Some directories may contribute zero duplicates; they should not appear in the output.
- Watch for very long content strings; avoid redundant slicing beyond necessary parsing.


## Follow-up Variants
- Detect duplicates across machines by hashing file contents rather than storing the entire content.
- Handle empty files or binary data encoded in hexadecimal.
- Return a representative file for each group to facilitate deduplication.


## Takeaways
- Inverted indexes are a natural fit for grouping by shared attributes.
- Parsing structured strings carefully avoids off-by-one errors and overhead.
- Filtering after grouping keeps the main logic simple and efficient.


## Similar Problems
| Problem ID | Problem Title | Technique |
| --- | --- | --- |
| 49 | Group Anagrams | Signature grouping |
| 652 | Find Duplicate Subtrees | Structural hashing |
| 187 | Repeated DNA Sequences | Hash by content |
