Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shutil.rmtree is inefficient due to listdir() instead of scandir() #72750

Closed
enkore mannequin opened this issue Oct 30, 2016 · 9 comments
Closed

shutil.rmtree is inefficient due to listdir() instead of scandir() #72750

enkore mannequin opened this issue Oct 30, 2016 · 9 comments
Labels
3.7 stdlib type-feature

Comments

@enkore
Copy link
Mannequin

@enkore enkore mannequin commented Oct 30, 2016

BPO 28564
Nosy @pitrou, @nh2, @serhiy-storchaka, @MojoVampire, @zhangyangyu, @enkore
PRs
  • #4085
  • Dependencies
  • bpo-25996: Add support of file descriptor in os.scandir()
  • Files
  • shutil-rmtree-scandir.patch
  • bench_rmtree.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2017-11-04.12:46:52.211>
    created_at = <Date 2016-10-30.20:02:39.210>
    labels = ['3.7', 'type-feature', 'library']
    title = 'shutil.rmtree is inefficient due to listdir() instead of scandir()'
    updated_at = <Date 2017-12-30.05:15:42.120>
    user = 'https://github.com/enkore'

    bugs.python.org fields:

    activity = <Date 2017-12-30.05:15:42.120>
    actor = 'nh2'
    assignee = 'none'
    closed = True
    closed_date = <Date 2017-11-04.12:46:52.211>
    closer = 'serhiy.storchaka'
    components = ['Library (Lib)']
    creation = <Date 2016-10-30.20:02:39.210>
    creator = 'enkore'
    dependencies = ['25996']
    files = ['45382', '45619']
    hgrepos = []
    issue_num = 28564
    keywords = ['patch']
    message_count = 9.0
    messages = ['279745', '279801', '279813', '280213', '281618', '305000', '305001', '305556', '309219']
    nosy_count = 6.0
    nosy_names = ['pitrou', 'nh2', 'serhiy.storchaka', 'josh.r', 'xiang.zhang', 'enkore']
    pr_nums = ['4085']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue28564'
    versions = ['Python 3.7']

    @enkore
    Copy link
    Mannequin Author

    @enkore enkore mannequin commented Oct 30, 2016

    The use of os.listdir severely limits the speed of this function on
    anything except solid-state drives.

    Using the new-in-Python 3.5 os.scandir should eliminate this
    bottleneck.

    @enkore enkore mannequin added the stdlib label Oct 30, 2016
    @serhiy-storchaka serhiy-storchaka added 3.7 type-feature labels Oct 30, 2016
    @MojoVampire
    Copy link
    Mannequin

    @MojoVampire MojoVampire mannequin commented Oct 31, 2016

    You need to cache the names up front because the loop is unlinking entries, and readdir isn't consistent when the directory entries are mutated between calls. emscripten-core/emscripten#2528

    FindFirstFile/FindNextFile likely has similar issues, even if they're not consistently seen (due to DeleteFile itself not guaranteeing deletion until the last handle to the file is closed).

    scandir might save some stat calls, but you'd need to convert it from generator to list before the loop begins, which would limit the savings a bit.

    @enkore
    Copy link
    Mannequin Author

    @enkore enkore mannequin commented Oct 31, 2016

    The main issue on *nix is more likely that by using listdir you get directory order, while what you really need is inode ordering. scandir allows for that, since you get the inode from the DirEntry with no extra syscalls - especially without an open() or stat().

    Other optimizations are also possible. For example opening the directory and using unlinkat() would likely shave off a bit of CPU. But the dominating factor here is likely the bad access pattern.

    @serhiy-storchaka
    Copy link
    Member

    @serhiy-storchaka serhiy-storchaka commented Nov 7, 2016

    Proposed patch implements shutil.rmtree using os.scandir. Needed file descriptors support in os.scandir (bpo-25996). I did not test how this affects the performance of shutil.rmtree.

    @serhiy-storchaka
    Copy link
    Member

    @serhiy-storchaka serhiy-storchaka commented Nov 24, 2016

    Benchmarks show about 20% speed up.

    @serhiy-storchaka
    Copy link
    Member

    @serhiy-storchaka serhiy-storchaka commented Oct 25, 2017

    Following Antoine's suggestion the patch now makes shutil.rmtree() using os.scandir() on all platforms.

    I doubt about one thing. This patch changes os.listdir passed to the onerror handler to os.scandir. This can break a user code that checks if the first argument in onerror is os.listdir. If keep this change, it should be documented in the "Porting to 3.7" section. Alternatively, we can continue passing os.listdir if os.scandir() failed despites the fact that os.listdir no longer used.

    @pitrou
    Copy link
    Member

    @pitrou pitrou commented Oct 25, 2017

    I think we should change to os.scandir. No need to accumulate compatibility baggage like that.

    @serhiy-storchaka
    Copy link
    Member

    @serhiy-storchaka serhiy-storchaka commented Nov 4, 2017

    New changeset d4d79bc by Serhiy Storchaka in branch 'master':
    bpo-28564: Use os.scandir() in shutil.rmtree(). (bpo-4085)
    d4d79bc

    @nh2
    Copy link
    Mannequin

    @nh2 nh2 mannequin commented Dec 30, 2017

    I've filed https://bugs.python.org/issue32453, which is about O(n^2) deletion behaviour for large directories.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 stdlib type-feature
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants