Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization of mergerfs.dup for already duplicated datasets #140

Open
exp625 opened this issue Jul 14, 2023 · 1 comment
Open

Optimization of mergerfs.dup for already duplicated datasets #140

exp625 opened this issue Jul 14, 2023 · 1 comment

Comments

@exp625
Copy link

exp625 commented Jul 14, 2023

I have observed that the mergerfs.dup command takes a significant amount of time to execute on a dataset that has already been duplicated. Currently, I have the following setup:

/mnt/disk1:/mnt/disk2 /mnt/pool

Within the /mnt/pool directory, there is a folder called /mnt/pool/data containing approximately 273GB of data with 147,148 files. My objective is to maintain duplicate copies of this folder on both drives. To achieve this, I am using the command /usr/local/bin/mergerfs.dup -d newest -c 2 -e /mnt/pool/data.

The execution of this command takes approximately 45 minutes, even when no actual copying is required. The script performs an rsync overwrite for each file.

To optimize the performance, I propose a modification to the *_dupfun functions to return if an overwrite is necessary:

def newest_dupfun(default_basepath,relpath,basepaths):
    sts = dict([(f,os.lstat(os.path.join(f,relpath))) for f in basepaths])

    mtime = sts[basepaths[0]].st_mtime
    if not all([st.st_mtime == mtime for st in sts.values()]):
        return sorted(sts,key=lambda x: sts.get(x).st_mtime,reverse=True)[0], True

    ctime = sts[basepaths[0]].st_ctime
    if not all([st.st_ctime == ctime for st in sts.values()]):
        return sorted(sts,key=lambda x: sts.get(x).st_ctime,reverse=True)[0], False

    return default_basepath, False

Modifying the call of the *_dupfun functions

srcpath, overwrite = dupfun(basepath,relpath,existing)

Then, a simple check can be added to determine whether an overwrite is necessary before executing the rsync command´:

for tgtpath in existing:
                  if prune and i >= count:
                      break
                  copies.append(tgtpath)
                  if overwrite:
                    args = build_copy_file(srcpath,tgtpath,relpath)
                    print('# overwrite')
                    print_args(args)
                    if execute:
                      execute_cmd(args)
                  i += 1

These changes have significantly improved the performance, reducing the script execution time to just 1 minute. Furthermore, the output log now only displays actual changes made to the file system. The rsync overwrites never actually did anything as the files were already duplicated.

This change shoud improve the performace for all the *_dupfun functions except for mergerfs_dupfun where

def mergerfs_dupfun(default_basepath,relpath,basepaths):
    return default_basepath, True

would trigger an overwrite everytime, because no other check is possible.

Please let me know if you can spot any issues. If you'd like I can creating a merge request for these changes for you to review.

@sjtuross
Copy link

I find this issue as I'm looking for a way to disable overwrite.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants