Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shutil.copytree doesn't seem to understand NTFS junctions, can truncates files when tricked #104046

Open
LazyDodo opened this issue May 1, 2023 · 2 comments
Labels
OS-windows stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@LazyDodo
Copy link

LazyDodo commented May 1, 2023

Bug report

Minimal repro extracted from Blender Issue #99766 , When copying a folder containing a directory junction some odd behavior can be observed.

When called with symlinks=True when copying a folder, the junction will be lost even though the documentation states If symlinks is true, symbolic links in the source tree are represented as symbolic links in the new tree and the metadata of the original links will be copied as far as the platform allows; while junctions vs symlinks generally can lead to a passionate debate which, for the sake of simplicity i'd like to skip over, and rather compare the behavior to the build in copy tools in windows, xcopy also loses the junction, so while it's inconvenient it's not completely unexpected behavior, shutil.copytree gets a pass here in my book.

However... there is some destructive behavior shutil.copytree has xcopy does not, when copying to folders on top of each other with both already containing a junction point pointing to the same folder.

with a directory structure like this

my-app/
├─ source/
│  ├─ Hello.py
├─ Project1/
│  ├─ test[Junction to my-app/source]/
│    ├─ Hello.py
├─ Project2/
│  ├─ test[Junction to my-app/source]/
│    ├─ Hello.py

When calling shutil.copytree to copy Project1 on top of Project2 it will try to create a new file Project2/test/Hello.py which truncates the original source/Hello.py which is a bit more destructive than I would have liked.

Repro:

REM repro.cmd
set WORK_DIR=%~dp0
REM Create some important files
mkdir source
echo print("world") > source/hello.py

REM Create a project with a junction to the source folder
mkdir project1
mklink /J "%WORK_DIR%/project1/test" "%WORK_DIR%/source"

REM Create a second project with a junction to the source folder
mkdir project2
mklink /J "%WORK_DIR%/project2/test" "%WORK_DIR%/source"

dir /s > before.txt

REM Case 1: Copy project1 to a whole new folder.
REM This will copy the files, but will lose the junction even though we asked to copy symlinks.
REM After the copy project3/test/hello.py and source/hello.py will be individual files changes to one do not affect the other.
python.exe -c "import shutil; shutil.copytree(r'%WORK_DIR%project1',r'%WORK_DIR%project3', dirs_exist_ok=True, symlinks=True)"
dir /s > case1.txt

REM Case 2: Copy 2 folders with identical junctions already in place on top of each other.
REM This is much MUCH more dangerous as copytree appears to create a new file for project2/test/hello.py
REM which truncates the original source/hi.py to 0 bytes.
python.exe -c "import shutil; shutil.copytree(r'%WORK_DIR%project1',r'%WORK_DIR%project2', dirs_exist_ok=True, symlinks=True)"
dir /s > case2.txt

While the example is squarely in the "seems far fetched, why would you even do that!?" category, sadly this was extracted from a real life scenario where people did lose their work Blender Issue #99766

Your environment

  • CPython versions tested on: 3.10/3.11
  • Operating system and architecture: Windows 10/X64
@LazyDodo LazyDodo added the type-bug An unexpected behavior, bug, or error label May 1, 2023
@arhadthedev arhadthedev added OS-windows stdlib Python modules in the Lib dir labels May 1, 2023
@eryksun
Copy link
Contributor

eryksun commented May 1, 2023

A junction is like a Unix bind mount point. So, for comparison, let's take a look at the behavior of shutil.copytree() with a bind mount point on Linux.

#!/usr/bin/bash
# repro.sh
# Create some important files
mkdir source
echo 'print("world")' > source/hello.py

# Create a project with a bind mount to the source folder
mkdir -p project1/test
mount --bind source project1/test

# Create a second project with a bind mount to the source folder
mkdir -p project2/test
mount --bind source project2/test

ls -1shR > before.txt

# Case 1: Copy project1 to a whole new folder.
# This will copy the files, but will lose the mount point even though we asked
# to copy symlinks. After the copy project3/test/hello.py and source/hello.py
# will be individual files; changes to one do not affect the other.
python -c 'import shutil; shutil.copytree("project1", "project3", dirs_exist_ok=True, symlinks=True)'
ls -1shR > case1.txt

# Case 2: Copy 2 folders with identical bind mounts already in place on top of
# each other. This fails because the _samefile(src, dst) call in
# shutil.copyfile() is true for "hello.py".
python -c 'import shutil; shutil.copytree("project1", "project2", dirs_exist_ok=True, symlinks=True)'
ls -1shR > case2.txt

umount project1/test
umount project2/test

As expected for case 1, shutil.copytree() copies a new "hello.py" file, as shown by the differing inode numbers.

# stat -c %i source/hello.py
22807104
# stat -c %i project3/test/hello.py
22807167

For case 2, on the other hand, trying to copy the "project1" bind mount onto the "project2" bind mount fails because "hello.py" is the same file.

shutil.Error: ['<', 'D', 'i', 'r', 'E', 'n', 't', 'r', 'y', ' ', "'", 'h', 'e', 'l', 'l', 'o', '.', 'p', 'y', "'", '>', ' ', 'a', 'n', 'd', ' ', "'", 'p', 'r', 'o', 'j', 'e', 'c', 't', '2', '/', 't', 'e', 's', 't', '/', 'h', 'e', 'l', 'l', 'o', '.', 'p', 'y', "'", ' ', 'a', 'r', 'e', ' ', 't', 'h', 'e', ' ', 's', 'a', 'm', 'e', ' ', 'f', 'i', 'l', 'e']

Aside from the bad formatting of the error message, the interesting part is that it's comparing a source path that's an os.DirEntry instance with a target path that's a string. Let's check the implementation of shutil._samefile():

cpython/Lib/shutil.py

Lines 204 to 216 in d448fcb

def _samefile(src, dst):
# Macintosh, Unix.
if isinstance(src, os.DirEntry) and hasattr(os.path, 'samestat'):
try:
return os.path.samestat(src.stat(), os.stat(dst))
except OSError:
return False
if hasattr(os.path, 'samefile'):
try:
return os.path.samefile(src, dst)
except OSError:
return False

On Windows, the values of st_dev and st_ino from the result of os.DirEntry.stat() are both 01, unless it's a symlink. On the other hand, for os.stat() the value of st_dev is the volume serial number, and the value of st_ino is the file ID2. Thus even though it's the same file, os.path.samestat(src.stat(), os.stat(dst)) returns false.

On Windows, shutil._samefile() should prefer os.path.samefile(). This is still a flawed comparison because it relies on the st_dev and st_ino stat values, which may both be 0 on Windows (e.g. a WebDAV filesystem), but at least there's an open issue to implement a more rigorous ntpath.samefile() that compares the final NT path name when the st_dev or st_ino value is unreliable.

Footnotes

  1. It's possible to redesign os.scandir() on Windows to get st_dev and st_ino cheaply in the os.DirEntry.stat() result. Instead of using FindFirstFileW() and FindNextFileW(), directly open the directory and call GetVolumeInformationByHandleW() to get the volume serial number, which only has to be called once. For the directory listing itself, use the same handle to call GetFileInformationByHandleEx() to get the FileIdBothDirectoryInfo, in batches with a buffer size of 64 KiB, which provides the basic stat info and file ID for each entry.

  2. The volume serial number and file ID may be 0 if the filesystem doesn't support them. However, they're commonly supported, except by some filesystem redirectors such as WebDAV.

@eryksun
Copy link
Contributor

eryksun commented May 1, 2023

Note that if the Windows batch script is changed to use mklink /D to create directory symbolic links, and the shutil.copytree() call is changed to use symlinks=False, then in this case "source\hello.py" also gets truncated. It's due to the same design flaw in shutil._samefile().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OS-windows stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

3 participants