Skip to content

collecting checksum for an external dir dependency takes too much time #2891

@shcheklein

Description

@shcheklein

DVC version:

DVC version: 0.71.0+d42a04
Python version: 3.7.5
Platform: Darwin-18.2.0-x86_64-i386-64bit
Binary: False
Package: None
Cache: reflink - True, hardlink - True, symlink - True
Filesystem type (cache directory): ('apfs', '/dev/disk1s1')
Filesystem type (workspace): ('apfs', '/dev/disk1s1')

Reproduce:

python -m cProfile -o s3-dep-images.prof -m dvc run -v -d s3://dvc-common/images -o out.txt "aws s3 ls s3://dvc-storage/get-started/fe/ >out.txt"

takes 3-4 hours to complete.

s3://dvc-common/images "Directory" has ~0.5M files.

Profile file:

https://www.dropbox.com/s/uqdjmm4pjkn0rj1/s3-dep-images.prof?dl=0

96% of time goes to thread lock.

Output:

Mon Dec  2 23:11:43 2019    s3-dep-images.prof

         325633688 function calls (322461350 primitive calls) in 10724.500 seconds

   Ordered by: internal time
   List reduced from 7431 to 100 due to restriction <100>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   596400 10371.995    0.017 10371.995    0.017 {method 'acquire' of '_thread.lock' objects}
    23894  152.922    0.006  152.922    0.006 {method 'read' of '_ssl._SSLSocket' objects}
  5782566   12.714    0.000   21.243    0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/dateutil/parser/_parser.py:83(get_token)
  4475479    9.278    0.000   10.816    0.000 {method 'write' of '_io.TextIOWrapper' objects}
   444810    5.790    0.000   52.679    0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/dateutil/parser/_parser.py:669(_parse)
      459    5.261    0.011    5.261    0.011 {method 'feed' of 'xml.etree.ElementTree.XMLParser' objects}
 16273231    5.019    0.000    8.242    0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/tqdm/utils.py:300(<genexpr>)
   440822    4.726    0.000    5.103    0.000 /usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py:216(__init__)
   440796    4.490    0.000 10374.041    0.024 /usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py:408(result)
   134568    4.240    0.000   28.491    0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/tqdm/std.py:309(format_meter)
1352110/1352109    4.128    0.000    5.655    0.000 {method 'format' of 'str' objects}
   889629    3.983    0.000   13.170    0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/dateutil/parser/_parser.py:878(_parse_numeric_token)
   297700    3.369    0.000 10376.913    0.035 /usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py:264(wait)
   269133    3.349    0.000    4.986    0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/tqdm/utils.py:209(_is_ascii)
445259/463    3.343    0.000   83.457    0.180 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/botocore/parsers.py:362(_handle_structure)
 16138665    3.222    0.000    3.222    0.000 {built-in method unicodedata.east_asian_width}
   579377    2.974    0.000   12.928    0.000 {built-in method builtins.sum}
  1334439    2.732    0.000    2.732    0.000 {method 'replace' of 'datetime.datetime' objects}
  2203982    2.699    0.000    5.480    0.000 /Users/ivan/Projects/dvc/dvc/path_info.py:183(parts)
   440797    2.623    0.000 10418.901    0.024 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/tqdm/std.py:1061(__iter__)
2673319/521    2.482    0.000   83.474    0.160 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/botocore/parsers.py:299(_parse_shape)
   440797    2.433    0.000 10376.699    0.024 /usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py:591(result_iterator)
   441675    2.391    0.000    2.621    0.000 /usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py:700(__str__)
  1334430    2.356    0.000    3.138    0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/dateutil/parser/_parser.py:434(append)
  2807755    2.261    0.000    2.261    0.000 {method 'sub' of 're.Pattern' objects}
 12454653    2.224    0.000    3.369    0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/dateutil/parser/_parser.py:214(isnum)
   881628    1.954    0.000    4.412    0.000 /usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py:63(parse_parts)
   440797    1.850    0.000    8.336    0.000 /usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/thread.py:145(submit)
  2203982    1.828    0.000    7.840    0.000 /Users/ivan/Projects/dvc/dvc/path_info.py:209(__hash__)
  5782566    1.813    0.000   23.057    0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/dateutil/parser/_parser.py:195(__next__)
 11596386    1.809    0.000    1.811    0.000 {built-in method builtins.getattr}
        1    1.726    1.726 10433.537 10433.537 /Users/ivan/Projects/dvc/dvc/remote/base.py:180(_calculate_checksums)

Artboard 1

Metadata

Metadata

Assignees

No one assigned

    Labels

    performanceimprovement over resource / time consuming tasks

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions