lfs: internal `_filter_paths()` function is prohibitively slow

The internal helper function [`_filter_paths()`](https://github.com/iterative/scmrepo/blob/fb99c42816f3c58a7d379092479d72338c3c2cac/src/scmrepo/git/lfs/fetch.py#L121-L133) in the `lfs` subpackage is prohibitively slow at the moment.

For instance, when I host the (rather small) MNIST dataset (70,000 image files) using Git LFS and import it using `dvc import`, then [`_filter_paths()` is effectively called](https://github.com/iterative/scmrepo/blob/fb99c42816f3c58a7d379092479d72338c3c2cac/src/scmrepo/git/lfs/fetch.py#L108) like this:

```python
paths = ["data/train/0/1.png", "data/train/0/2.png", ...]  # list of all 70,000 image files

for path in _filter_paths(paths, include=paths, exclude=None):
    # yield LFS pointer data
```

So, in this simple case, the current implementation has quadratic complexity in the number of LFS files and materializes generators unnecessarily, which is already significant with only 70,000 files – imagine large datasets.

I've done a quick benchmark to quantify the current implementation performance and compare it against two alternatives:

* Current implementation:

    ```python
    import timeit

    setup = """\
    import fnmatch
    from typing import Iterable, Iterator, Optional

    import fsspec


    def _filter_paths(
        paths: Iterable[str], include: Optional[list[str]], exclude: Optional[list[str]]
    ) -> Iterator[str]:
        filtered = set()
        if include:
            for pattern in include:
                filtered.update(fnmatch.filter(paths, pattern))
        else:
            filtered.update(paths)
        if exclude:
            for pattern in exclude:
                filtered.difference_update(fnmatch.filter(paths, pattern))
        yield from filtered


    fs = fsspec.filesystem("file")
    paths = fs.find("data/")
    """


    timeit.timeit("list(_filter_paths(paths, include=paths, exclude=None))", setup=setup, number=1)
    # >>> 632.4136102720004 seconds
    ```

* A more efficient implementation with identical semantics by manually unionizing the filename regex patterns and matching each path against the pre-compiled single regex:

    ```python
    import timeit

    setup = """\
    import fnmatch
    import re
    from typing import Iterable, Iterator, Optional

    import fsspec


    def _filter_paths(
        paths: Iterable[str], include: Optional[list[str]], exclude: Optional[list[str]]
    ) -> Iterator[str]:
        if include:
            include_match = re.compile(r"|".join(map(fnmatch.translate, include))).match
            paths = (path for path in paths if include_match(path) is not None)
        if exclude:
            exclude_match = re.compile(r"|".join(map(fnmatch.translate, exclude))).match
            paths = (path for path in paths if exclude_match(path) is None)
        yield from paths


    fs = fsspec.filesystem("file")
    paths = fs.find("data/")
    """


    timeit.timeit("list(_filter_paths(paths, include=paths, exclude=None))", setup=setup, number=1)
    # >>> 118.97085808499833 seconds
    ```

* Assuming we actually don't need Unix filename pattern matching (`fnmatch`) – it [doesn't seem to be needed _at the moment_ by DVC](https://github.com/iterative/dvc/blob/7df58ce404d682e7fd8518374648428d171ca159/dvc/scm.py#L280-L285) –, an implementation using set lookups (O(1) complexity) would be _orders of magnitude_ faster:

    ```python
    import timeit

    setup = """\
    import fnmatch
    import re
    from typing import Iterable, Iterator, Optional

    import fsspec


    def _filter_paths(
        paths: Iterable[str], include: Optional[list[str]], exclude: Optional[list[str]]
    ) -> Iterator[str]:
        if include:
            include_set = set(include)
            paths = (path for path in paths if path in include_set)
        if exclude:
            exclude_set = set(exclude)
            paths = (path for path in paths if path not in exclude_set)
        yield from paths


    fs = fsspec.filesystem("file")
    paths = fs.find("data/")
    """


    timeit.timeit("list(_filter_paths(paths, include=paths, exclude=None))", setup=setup, number=1)
    # >>> 0.00669500499861897 seconds
    ```

WDYT about the two alternative implementations? Is Unix filename pattern matching really needed?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

lfs: internal `_filter_paths()` function is prohibitively slow #338

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

lfs: internal _filter_paths() function is prohibitively slow #338

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

lfs: internal `_filter_paths()` function is prohibitively slow #338