Optimization of mergerfs.dup for already duplicated datasets #140

exp625 · 2023-07-14T17:30:04Z

I have observed that the mergerfs.dup command takes a significant amount of time to execute on a dataset that has already been duplicated. Currently, I have the following setup:

/mnt/disk1:/mnt/disk2 /mnt/pool

Within the /mnt/pool directory, there is a folder called /mnt/pool/data containing approximately 273GB of data with 147,148 files. My objective is to maintain duplicate copies of this folder on both drives. To achieve this, I am using the command /usr/local/bin/mergerfs.dup -d newest -c 2 -e /mnt/pool/data.

The execution of this command takes approximately 45 minutes, even when no actual copying is required. The script performs an rsync overwrite for each file.

To optimize the performance, I propose a modification to the *_dupfun functions to return if an overwrite is necessary:

def newest_dupfun(default_basepath,relpath,basepaths):
    sts = dict([(f,os.lstat(os.path.join(f,relpath))) for f in basepaths])

    mtime = sts[basepaths[0]].st_mtime
    if not all([st.st_mtime == mtime for st in sts.values()]):
        return sorted(sts,key=lambda x: sts.get(x).st_mtime,reverse=True)[0], True

    ctime = sts[basepaths[0]].st_ctime
    if not all([st.st_ctime == ctime for st in sts.values()]):
        return sorted(sts,key=lambda x: sts.get(x).st_ctime,reverse=True)[0], False

    return default_basepath, False

Modifying the call of the *_dupfun functions

srcpath, overwrite = dupfun(basepath,relpath,existing)

Then, a simple check can be added to determine whether an overwrite is necessary before executing the rsync command´:

for tgtpath in existing:
                  if prune and i >= count:
                      break
                  copies.append(tgtpath)
                  if overwrite:
                    args = build_copy_file(srcpath,tgtpath,relpath)
                    print('# overwrite')
                    print_args(args)
                    if execute:
                      execute_cmd(args)
                  i += 1

These changes have significantly improved the performance, reducing the script execution time to just 1 minute. Furthermore, the output log now only displays actual changes made to the file system. The rsync overwrites never actually did anything as the files were already duplicated.

This change shoud improve the performace for all the *_dupfun functions except for mergerfs_dupfun where

def mergerfs_dupfun(default_basepath,relpath,basepaths):
    return default_basepath, True

would trigger an overwrite everytime, because no other check is possible.

Please let me know if you can spot any issues. If you'd like I can creating a merge request for these changes for you to review.

The text was updated successfully, but these errors were encountered:

sjtuross · 2023-12-17T10:16:17Z

I find this issue as I'm looking for a way to disable overwrite.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization of mergerfs.dup for already duplicated datasets #140

Optimization of mergerfs.dup for already duplicated datasets #140

exp625 commented Jul 14, 2023

sjtuross commented Dec 17, 2023

Optimization of mergerfs.dup for already duplicated datasets #140

Optimization of mergerfs.dup for already duplicated datasets #140

Comments

exp625 commented Jul 14, 2023

sjtuross commented Dec 17, 2023