-
Notifications
You must be signed in to change notification settings - Fork 15
Description
The internal helper function _filter_paths() in the lfs subpackage is prohibitively slow at the moment.
For instance, when I host the (rather small) MNIST dataset (70,000 image files) using Git LFS and import it using dvc import, then _filter_paths() is effectively called like this:
paths = ["data/train/0/1.png", "data/train/0/2.png", ...] # list of all 70,000 image files
for path in _filter_paths(paths, include=paths, exclude=None):
# yield LFS pointer dataSo, in this simple case, the current implementation has quadratic complexity in the number of LFS files and materializes generators unnecessarily, which is already significant with only 70,000 files – imagine large datasets.
I've done a quick benchmark to quantify the current implementation performance and compare it against two alternatives:
-
Current implementation:
import timeit setup = """\ import fnmatch from typing import Iterable, Iterator, Optional import fsspec def _filter_paths( paths: Iterable[str], include: Optional[list[str]], exclude: Optional[list[str]] ) -> Iterator[str]: filtered = set() if include: for pattern in include: filtered.update(fnmatch.filter(paths, pattern)) else: filtered.update(paths) if exclude: for pattern in exclude: filtered.difference_update(fnmatch.filter(paths, pattern)) yield from filtered fs = fsspec.filesystem("file") paths = fs.find("data/") """ timeit.timeit("list(_filter_paths(paths, include=paths, exclude=None))", setup=setup, number=1) # >>> 632.4136102720004 seconds
-
A more efficient implementation with identical semantics by manually unionizing the filename regex patterns and matching each path against the pre-compiled single regex:
import timeit setup = """\ import fnmatch import re from typing import Iterable, Iterator, Optional import fsspec def _filter_paths( paths: Iterable[str], include: Optional[list[str]], exclude: Optional[list[str]] ) -> Iterator[str]: if include: include_match = re.compile(r"|".join(map(fnmatch.translate, include))).match paths = (path for path in paths if include_match(path) is not None) if exclude: exclude_match = re.compile(r"|".join(map(fnmatch.translate, exclude))).match paths = (path for path in paths if exclude_match(path) is None) yield from paths fs = fsspec.filesystem("file") paths = fs.find("data/") """ timeit.timeit("list(_filter_paths(paths, include=paths, exclude=None))", setup=setup, number=1) # >>> 118.97085808499833 seconds
-
Assuming we actually don't need Unix filename pattern matching (
fnmatch) – it doesn't seem to be needed at the moment by DVC –, an implementation using set lookups (O(1) complexity) would be orders of magnitude faster:import timeit setup = """\ import fnmatch import re from typing import Iterable, Iterator, Optional import fsspec def _filter_paths( paths: Iterable[str], include: Optional[list[str]], exclude: Optional[list[str]] ) -> Iterator[str]: if include: include_set = set(include) paths = (path for path in paths if path in include_set) if exclude: exclude_set = set(exclude) paths = (path for path in paths if path not in exclude_set) yield from paths fs = fsspec.filesystem("file") paths = fs.find("data/") """ timeit.timeit("list(_filter_paths(paths, include=paths, exclude=None))", setup=setup, number=1) # >>> 0.00669500499861897 seconds
WDYT about the two alternative implementations? Is Unix filename pattern matching really needed?