Skip to content

lfs: internal _filter_paths() function is prohibitively slow #338

@sisp

Description

@sisp

The internal helper function _filter_paths() in the lfs subpackage is prohibitively slow at the moment.

For instance, when I host the (rather small) MNIST dataset (70,000 image files) using Git LFS and import it using dvc import, then _filter_paths() is effectively called like this:

paths = ["data/train/0/1.png", "data/train/0/2.png", ...]  # list of all 70,000 image files

for path in _filter_paths(paths, include=paths, exclude=None):
    # yield LFS pointer data

So, in this simple case, the current implementation has quadratic complexity in the number of LFS files and materializes generators unnecessarily, which is already significant with only 70,000 files – imagine large datasets.

I've done a quick benchmark to quantify the current implementation performance and compare it against two alternatives:

  • Current implementation:

    import timeit
    
    setup = """\
    import fnmatch
    from typing import Iterable, Iterator, Optional
    
    import fsspec
    
    
    def _filter_paths(
        paths: Iterable[str], include: Optional[list[str]], exclude: Optional[list[str]]
    ) -> Iterator[str]:
        filtered = set()
        if include:
            for pattern in include:
                filtered.update(fnmatch.filter(paths, pattern))
        else:
            filtered.update(paths)
        if exclude:
            for pattern in exclude:
                filtered.difference_update(fnmatch.filter(paths, pattern))
        yield from filtered
    
    
    fs = fsspec.filesystem("file")
    paths = fs.find("data/")
    """
    
    
    timeit.timeit("list(_filter_paths(paths, include=paths, exclude=None))", setup=setup, number=1)
    # >>> 632.4136102720004 seconds
  • A more efficient implementation with identical semantics by manually unionizing the filename regex patterns and matching each path against the pre-compiled single regex:

    import timeit
    
    setup = """\
    import fnmatch
    import re
    from typing import Iterable, Iterator, Optional
    
    import fsspec
    
    
    def _filter_paths(
        paths: Iterable[str], include: Optional[list[str]], exclude: Optional[list[str]]
    ) -> Iterator[str]:
        if include:
            include_match = re.compile(r"|".join(map(fnmatch.translate, include))).match
            paths = (path for path in paths if include_match(path) is not None)
        if exclude:
            exclude_match = re.compile(r"|".join(map(fnmatch.translate, exclude))).match
            paths = (path for path in paths if exclude_match(path) is None)
        yield from paths
    
    
    fs = fsspec.filesystem("file")
    paths = fs.find("data/")
    """
    
    
    timeit.timeit("list(_filter_paths(paths, include=paths, exclude=None))", setup=setup, number=1)
    # >>> 118.97085808499833 seconds
  • Assuming we actually don't need Unix filename pattern matching (fnmatch) – it doesn't seem to be needed at the moment by DVC –, an implementation using set lookups (O(1) complexity) would be orders of magnitude faster:

    import timeit
    
    setup = """\
    import fnmatch
    import re
    from typing import Iterable, Iterator, Optional
    
    import fsspec
    
    
    def _filter_paths(
        paths: Iterable[str], include: Optional[list[str]], exclude: Optional[list[str]]
    ) -> Iterator[str]:
        if include:
            include_set = set(include)
            paths = (path for path in paths if path in include_set)
        if exclude:
            exclude_set = set(exclude)
            paths = (path for path in paths if path not in exclude_set)
        yield from paths
    
    
    fs = fsspec.filesystem("file")
    paths = fs.find("data/")
    """
    
    
    timeit.timeit("list(_filter_paths(paths, include=paths, exclude=None))", setup=setup, number=1)
    # >>> 0.00669500499861897 seconds

WDYT about the two alternative implementations? Is Unix filename pattern matching really needed?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions