Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have shutil.copytree(), copy() and copystat() use cached scandir() stat()s #77876

Closed
giampaolo opened this issue May 30, 2018 · 13 comments
Closed
Assignees
Labels
3.8 (EOL) end of life performance Performance or resource usage stdlib Python modules in the Lib dir

Comments

@giampaolo
Copy link
Contributor

BPO 33695
Nosy @brettcannon, @ncoghlan, @vstinner, @giampaolo, @benjaminp, @tarekziade, @benhoyt, @serhiy-storchaka, @1st1
PRs
  • bpo-33695 shutil.copytree() + os.scandir() cache #7874
  • bpo-35652 Add use_srcentry parameter to shutil.copytree() II #11425
  • bpo-35652: shutil.copytree(copy_function=...): pass DirEntry instead of path str  #11997
  • bpo-38688: Consume iterator and create list of entries to prevent infinite loop #17098
  • Files
  • bench.py
  • bpo-33695.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/giampaolo'
    closed_at = <Date 2018-11-12.14:19:18.809>
    created_at = <Date 2018-05-30.12:22:35.667>
    labels = ['3.8', 'library', 'performance']
    title = 'Have shutil.copytree(), copy() and copystat() use cached scandir() stat()s'
    updated_at = <Date 2019-11-09.13:25:26.220>
    user = 'https://github.com/giampaolo'

    bugs.python.org fields:

    activity = <Date 2019-11-09.13:25:26.220>
    actor = 'kinow'
    assignee = 'giampaolo.rodola'
    closed = True
    closed_date = <Date 2018-11-12.14:19:18.809>
    closer = 'giampaolo.rodola'
    components = ['Library (Lib)']
    creation = <Date 2018-05-30.12:22:35.667>
    creator = 'giampaolo.rodola'
    dependencies = []
    files = ['47624', '47625']
    hgrepos = []
    issue_num = 33695
    keywords = ['patch']
    message_count = 13.0
    messages = ['318175', '320303', '320304', '321852', '322872', '322873', '322901', '322912', '322933', '322975', '322984', '328267', '329732']
    nosy_count = 10.0
    nosy_names = ['brett.cannon', 'ncoghlan', 'vstinner', 'giampaolo.rodola', 'benjamin.peterson', 'tarek', 'stutzbach', 'benhoyt', 'serhiy.storchaka', 'yselivanov']
    pr_nums = ['7874', '11425', '11997', '17098']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue33695'
    versions = ['Python 3.8']

    @giampaolo
    Copy link
    Contributor Author

    Patch in attachment makes shutil.copytree() use os.scandir() and (differently from bpo-33414) DirEntry instances are passed around so that cached stat()s are used also from within copy2() and copystat() functions. The number of times the filesystem gets accessed via os.stat() is therefore reduced quite consistently. A similar improvement can be done for rmtree() (but that's for another ticket). Patch and benchmark script are in attachment.

    Linux (+13.5% speedup)
    ======================

    --- without patch:

    ./python  bench.py 
    Priming the system's cache...
    7956 files and dirs, repeat 1/3... min = 0.551s
    7956 files and dirs, repeat 2/3... min = 0.548s
    7956 files and dirs, repeat 3/3... min = 0.548s
    best result = 0.548s
    

    --- with patch:

        $ ./python  bench.py 
        Priming the system's cache...
        7956 files and dirs, repeat 1/3... min = 0.481s
        7956 files and dirs, repeat 2/3... min = 0.479s
        7956 files and dirs, repeat 3/3... min = 0.474s
        best result = 0.474s

    Windows (+17% speedup)
    ======================

    --- without patch:

    ./python  bench.py 
    Priming the system's cache...
    7956 files and dirs, repeat 1/3... min = 9.015s
    7956 files and dirs, repeat 2/3... min = 8.747s
    7956 files and dirs, repeat 3/3... min = 8.614s
    best result = 8.614s
    

    --- with patch:

        $ ./python  bench.py 
        Priming the system's cache...
        7956 files and dirs, repeat 1/3... min = 7.827s
        7956 files and dirs, repeat 2/3... min = 7.369s
        7956 files and dirs, repeat 3/3... min = 7.153s
        best result = 7.153s

    Windows SMB share (+30%)
    ========================

    --- without patch:

    C:\Users\user\Desktop\cpython>PCbuild\win32\python.exe bench.py
    Priming the system's cache...
    7956 files and dirs, repeat 1/3... min = 46.853s
    7956 files and dirs, repeat 2/3... min = 46.330s
    7956 files and dirs, repeat 3/3... min = 44.720s
    best result = 44.720s
    

    --- with patch:

    C:\Users\user\Desktop\cpython>PCbuild\win32\python.exe bench.py
    Priming the system's cache...
    7956 files and dirs, repeat 1/3... min = 31.729s
    7956 files and dirs, repeat 2/3... min = 30.936s
    7956 files and dirs, repeat 3/3... min = 30.936s
    best result = 30.936s
    

    Number of stat() syscalls (-38%)
    ================================

    --- without patch:

        $ strace ./python bench.py  2>&1 | grep "stat(" | wc -l
        324808
        
    --- with patch:
    
        $ strace ./python bench.py  2>&1 | grep "stat(" | wc -l
        198768

    @giampaolo giampaolo added 3.8 (EOL) end of life stdlib Python modules in the Lib dir performance Performance or resource usage labels May 30, 2018
    @giampaolo
    Copy link
    Contributor Author

    PR at: #7874.
    I re-ran benchmarks since shutil code changed after bpo-33695. Linux went from +13.5% to 8.8% and Windows went from +17% to 20.7%.
    In the PR I explicitly avoided using a context manager around os.scandir() for now so that patch it's easier to review (will add it before pushing).

    Linux (+8.8%)
    =============

    without patch:
    $ ./python bench-copytree.py
    Priming the system's cache...
    7956 files and dirs, repeat 1/3... min = 0.604s
    7956 files and dirs, repeat 2/3... min = 0.603s
    7956 files and dirs, repeat 3/3... min = 0.601s

    with patch:
        $ ./python  bench-copytree.py 
        Priming the system's cache...
        7956 files and dirs, repeat 1/3... min = 0.557s
        7956 files and dirs, repeat 2/3... min = 0.548s
        7956 files and dirs, repeat 3/3... min = 0.548s
        best result = 0.548s

    Windows (+20.7%)
    ================

    without patch:
    C:\Users\user\Desktop>cpython\PCbuild\win32\python.exe cpython\bench-copytree.py
    Priming the system's cache...
    7956 files and dirs, repeat 1/3... min = 8.275s
    7956 files and dirs, repeat 2/3... min = 8.018s
    7956 files and dirs, repeat 3/3... min = 7.978s
    best result = 7.978s

    With patch:
    C:\Users\user\Desktop>cpython\PCbuild\win32\python.exe cpython\bench-copytree.py
    Priming the system's cache...
    7956 files and dirs, repeat 1/3... min = 6.609s
    7956 files and dirs, repeat 2/3... min = 6.609s
    7956 files and dirs, repeat 3/3... min = 6.609s
    best result = 6.609s

    @giampaolo
    Copy link
    Contributor Author

    I re-ran benchmarks since shutil code changed after bpo-33695.

    Sorry, I meant bpo-33671.

    @giampaolo
    Copy link
    Contributor Author

    Unless somebody has complaints I think I'm gonna merge this soon.

    @serhiy-storchaka
    Copy link
    Member

    I'm not convinced that this change should be merged. The benefit is small, and 1) it is only for an artificial set of tiny files, 2) the benchmarking ignores the real IO, it measures the work with a cache. When copy real files (/usr/include or Lib/) with dropped caches the difference is insignificant. On other hand, this optimization makes the code more complex. It can make the case with specifying the ignore argument slower.

    @serhiy-storchaka
    Copy link
    Member

    For dropping disc caches on Linux run

        with open('/proc/sys/vm/drop_caches', 'ab') as f: f.write(b'3\n')

    before every test.

    @giampaolo
    Copy link
    Contributor Author

    I agree the provided benchmark on Linux should be more refined. And I'm not sure if "echo 3 | sudo tee /proc/sys/vm/drop_caches" before running it is enough honestly.

    The main point here is the reduction of stat() syscalls (-38%) and that can make a considerable difference, especially with network filesystems. That's basically the reason why scandir() was introduced in the first place and used in os.walk() glob.glob() and shutil.rmtree(), so I'm not sure why we should use a different rationale for shutil.copytree().

    @serhiy-storchaka
    Copy link
    Member

    os.walk() and glob.glob() used *only* stat(), opendir() and readdir() syscalls (and stat() syscalls dominated). The effect of reducing the number of the stat() syscalls is significant. shutil.rmtree() uses also the unlink() syscall. Since it is usually cheap (but see bpo-32453), the benefit still is good, but not such large. Actually I had concerns about using scandir() in shutil.rmtree().

    shutil.copytree() needs to open, read, and write files. This is not so cheap, and the benefit of reducing the number of the stat() syscalls is hardly noticed in real cases. shutil.copytree() was not converted to using scandir() intentionally.

    @vstinner
    Copy link
    Member

    vstinner commented Aug 2, 2018

    When I worked on the os.scandir() implementation, I recall that an interesting test was NFS. Depending on the configuration, stat() in a network filesystem can be between very slow and slow.

    @giampaolo
    Copy link
    Contributor Author

    Yes, file copy (open() + read() + write()) is of course more expensive than just "reading" a tree (os.walk(), glob()) or deleting it (rmtree()) and the "pure file copy" time adds up to the benchmark. And indeed it's not an coincidence that bpo-33671 (which replaced read() + write() with sendfile()) shaved off a 5% gain from the benchmark I posted initially for Linux.

    Still, in a 8k small-files-tree scenario we're seeing ~9% gain on Linux, 20% on Windows and 30% on a SMB share on localhost vs. VirtualBox. I do not consider this a "hardly noticeable gain" as you imply: it is noticeable, exponential and measurable, even with cache being involved (as it is).

    Note that the number of stat() syscalls per file is being reduced from 6 to 1 (or more if follow_symlinks=False), and that is the real gist here. That *does* make a difference on a regular Windows fs and makes a huge difference with network filesystems in general, as a simple stat() call implies access to the network, not the disk.

    @1st1
    Copy link
    Member

    1st1 commented Aug 2, 2018

    Depending on the configuration, stat() in a network filesystem can be between very slow and slow.

    +1. I also quickly glanced over the patch and I think it looks like a clear win.

    @giampaolo
    Copy link
    Contributor Author

    @serhiy: I would like to proceed with this. Do you have further comments? Do you prefer to bring this up on python-dev for further discussion?

    @giampaolo
    Copy link
    Contributor Author

    New changeset 19c46a4 by Giampaolo Rodola in branch 'master':
    bpo-33695 shutil.copytree() + os.scandir() cache (bpo-7874)
    19c46a4

    @giampaolo giampaolo self-assigned this Nov 12, 2018
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 (EOL) end of life performance Performance or resource usage stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants