Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Daemon to monitor changes in real-time and run restic #1502

Open
alphapapa opened this issue Dec 24, 2017 · 31 comments
Open

Daemon to monitor changes in real-time and run restic #1502

alphapapa opened this issue Dec 24, 2017 · 31 comments
Labels

Comments

@alphapapa
Copy link

alphapapa commented Dec 24, 2017

Having used CrashPlan and Obnam for a while, I'm very impressed by Restic. When it gains compression (and optionally disabled encryption), I think it will be a great alternative (for the software, not the backend service).

One of the nice things about CrashPlan is its real-time service that watches for changes to backup sets (using e.g. inotify on Linux) and runs the actual backup every so often, as configured.

And one of the reasons I'm so impressed with Restic is how fast its actual backup phase is. When backing up a large set of files with few of them having changed, most of the time is spent in the scan phase (e.g. #1160).

So, since Restic is so flexible about how it receives the list of paths to backup, it would be nice if there were a daemon that could run in the background and watch for changes to certain paths, then run Restic every so often on the files/directories that have changed. That could make Restic really fast and relatively lightweight for desktop backups.

Since Restic is so flexible in this regard, I don't know if it would even be necessary for such a daemon to be specific to Restic. If not, there might already be some solutions available, in which case we don't need to do anything, but documenting it would be very helpful. :)

What do you think?

Thanks for your work on Restic!

@rawtaz
Copy link
Contributor

rawtaz commented Dec 24, 2017

This is something I always thought about as well (I too used CrashPlan briefly), but honestly I haven't felt the need with restic, as you say it's very fast.

@fd0 fd0 added the type: feature suggestion suggesting a new feature label Dec 25, 2017
@alphapapa
Copy link
Author

alphapapa commented Dec 27, 2017

Thinking about this a bit more, I think it does need support in restic, because restic needs to use the same set of paths for each backup run without actually scanning them; it needs to do this so that the new snapshot will have the same paths and be considered the next snapshot of the same paths. It needs to do this while receiving the list of actual files to backup from another source, so it doesn't have to scan all the paths and find the modified files. Then it needs to merge that list of files with the existing snapshot. In other words, the end result (the snapshot) needs to be the same, as if restic had run normally.

So maybe restic needs to:

  1. Copy previous snapshot.
  2. Add/replace/remove paths given by daemon to the new snapshot.
  3. Save new snapshot.

The example scenario, for why this would be useful, is one in which there are thousands of directories in a backup set, but since the last snapshot, only a few files have been added, changed, or deleted. In this case, it takes restic much longer to scan all of the directories to find the changes than it does to actually backup the changes. A daemon could monitor the directories in real time with, e.g. inotify, and tell restic exactly what has changed. Restic then "just" needs to merge those changes with a copy of the previous snapshot.

This could make restic the "killer" backup app. I've been using restic to backup most of my homedir for a few days now in a cron job, and it's working well, but since not much data actually changes each time, the job as a whole takes a long time because of the scanning. If it could work like I've described here, the whole backup run could only take a few seconds.

@rawtaz
Copy link
Contributor

rawtaz commented Dec 27, 2017

@alphapapa Why can't you just have it create a snapshot with whatever files and directories in it that were changed (in a consolidated format of course)? For example, every time it runs, it could be done as if you manually asked it to back up those top-level items that changed manually.

Yes, there will be many snapshots with many different files and directory names, but so what? They will just tell you exactly what was backed up (except when some parts of it has been consolidated, e.g. when all files in a directory have changed it should of course only list those directories). I don't really see why you should try to hide the fact that only this and that item was backed up.

If it works like this, then this could even be done completely outside of restic; A separate process could utilize the filesystem notifier, build the list of items to back up, and when appropriate ask restic to back them up. It could be smart, e.g. in such a way that if X% of the files in a folder were changed, it will instead of listing all those files to the restic backup command, list the directory they're in and use excludes to not back up the non-changed ones. It could also use e.g. --files-from along with stdin (assuming supported by restic) like --files-from - instead of listing target files/directories on the command line when spawning restic.

EDIT: If you want to somehow have a way to identify your backups, then having a tag added to the snapshots might be a better idea than to have them identified by the contents of the snapshots.

@alphapapa
Copy link
Author

alphapapa commented Dec 28, 2017

@rawtaz If you did that, you would suffer data loss when forgetting and pruning snapshots:

  1. Snapshots are grouped by path.
  2. You can manually group snapshots by tag instead.
  3. If you do this, you will lose data when you forget snapshot B which contains files from directory Z which are not present in snapshots A and C.

This is the same issue Obnam has: it allows you to backup different sets of files to a single repository, but when applying retention policy, it does not group snapshots at all, so you will lose data when forgetting a generation that has files not present in other generations.

Besides that, you would make it virtually impossible to do a full restore of a set of directories. You would have to restore every snapshot in the repo, in order, to ensure that you get all files. Then, you would not be able to determine whether a file had been deleted or was merely unchanged from earlier snapshots and omitted from later snapshots, so it would be impossible to restore a directory's state at a certain date and time. IOW, snapshots would no longer be snapshots, but essentially tarballs of arbitrary sets of files.

In summary, no, that wouldn't work at all.

@rawtaz
Copy link
Contributor

rawtaz commented Dec 28, 2017

It works if you combine it with "full" runs regularly, and/or tags to control which snapshots you forget. But sure, I see what you're saying.

@alphapapa
Copy link
Author

Now you're talking about moving from a snapshot model to a full/incremental backup model, which is taking us back to the previous generation of backup software. If that's what you're after, you can already have that with, e.g. Duplicity.

@rawtaz
Copy link
Contributor

rawtaz commented Dec 28, 2017

I really am not. But whatever you say.

@rawtaz
Copy link
Contributor

rawtaz commented Dec 28, 2017

Perhaps an alternative solution to reach the goal you are seeking is to implement some type of --include and/or --include-file (that supports stdin) in restic, that makes it only process the files in the include. That way an external process dealing with filesystem notifications can run restic and tell it which files to process (out of those in the files/folders you list for the snapshot).

@alphapapa
Copy link
Author

That's what I proposed in the first message. The issue is that restic must merge the "included" files (including removing deleted ones) with the previous snapshot. Each snapshot must be complete or it's no longer a snapshot.

@arikb
Copy link

arikb commented Apr 17, 2018

This is amazing. I've started a thread in https://forum.restic.net/t/continuous-backup/593/13 starting from the same idea as @alphapapa - to create a separate tool driving restic - and the more I thought about it the more I realised it has to be a change to the way new snapshots are created in exactly the same rationale as here. I'd definitely want to see that happen.

Thank you for your work on restic!

@fd0 fd0 self-assigned this May 28, 2018
@tim-seoss
Copy link

Linux Storage, Filesystem, and Memory-Management Summit talk regarding kernel support for this kind of feature (also offers significant savings for incremental backups of large, but mostly-unchanging data sets):

https://lwn.net/Articles/755277/

@fd0 fd0 removed their assignment Jul 4, 2018
@whereisaaron
Copy link

Since looking at restic I have had the same thoughts as @alphapapa and @arikb

My performance problem with restic is not the actually backing up of files or dedup or (lack of) compression, but the fact it spends 99% of its time pointlessly scanning 100,000's of files to notice they haven't changed since an hour ago 😄 That's a lot of wasted effort and I/O that continuous solutions like CrashPlan and Carbonite avoid.

restic support backing up a list of specific files (--file-from). So my thought was to have a inotify wrapper/daemon that would build a list of modified files for 15-60 minutes, and the dispatch a restic backup with the accumulated file path list. Then once every 12-24 hours it would run restic with a full scan to ensure nothing is missed (inotify and NTFS events are not guaranteed, they use kernel memory and just get dropped if not picked up in time).

A continuous wrapper would reduce hourly restic backup times from ~30 minutes runtime to ~3 minutes. An greatly reduce the hourly I/O operations on the filesystem.

Has this been done already? Anyone seen a generic inotify wrapper that could do this or be adapted to do this?

I guess a shell script with inotifywait could do it for local filesystems.

@OmenWild
Copy link

Have you looked at lsyncd for the continuous monitoring by any chance? lsyncd is designed to watch the filesystem for changes, then call a sync process (supports direct copy, rsync, and ssh+rsync and supports custom backends.

I've been using it to make near-live copies of critical data to off site locations for years. The more I think about it, the more I think this could be what gets me off CrashPlan.

@whereisaaron
Copy link

@OmenWild yep took a look at that. It looks like it work work. However there a two huge flies in my inotify soup.

  1. It was pointed out to me that the file list each restic snapshot will just be the files accumulated in the inotify list. When you come to restore you'll have all these partial file lists, rather than point-in-time snapshots.

  2. Network filesystems like NFS have no inotify/fsevents like service, so you have to be able to run e.g. lsyncd directly on the NFS servers - which is sometimes not possible, e.g. for AWS EFS.

@anarcat
Copy link
Contributor

anarcat commented Nov 23, 2020

What would be even better would be a "continuous backup" strategy, where backups are not made of specific "snapshots" necessarily but just consist of a continuous stream of all changes done to the disk. Very few backup programs work this way (for obvious performance reasons), but some manage to make this meaningfully and it's pretty powerful.

I thought this would be basically impossible forever on Linux, but a LWN article about Changed-block tracking and differential backups in QEMU changed my mind: what they do is they use the network block device (NBD) protocol to track changes (remotely) and do magic things on top of that.

I'm just brainstorming here (and I'm coming from a longer conversation on the same topic in borg, see borgbackup/borg#325) so obviously there's no clear path there from here. But since it's the first time I have the slightest clue of how this could work, I figured I would share it with you folks as well. :)

@tim-seoss
Copy link

@anarcat I don't think this sort of VM-specific block level backups are useful for what Restic does.

Restic would require one or more APIs which allow user space to be notified of changes to filesystem entries.

It might be worth one of the Restic developers asking on the Linux-Fsdevel mailing list about this, since it looks like the fanotify infrastructure is currently under active development by @amir73il and others.

@tim-seoss
Copy link

This fsnotify/fsnotify#114 provides a summary of the position as of 1 year ago i.e. this appears to be possible using fanotify, with the caveat that events are only available for an entire filesystem, and the filtering (e.g. if only part of the filesystem was backed up) would need to be carried out in user-space - the project linked is a go library, and so might also be relevant to this project.

@amir73il
Copy link

@tim-seoss your understanding of what fanotify provides in kernel v5.9 seems correct.

I should also point out https://facebook.github.io/watchman/ as a service that collects fs notification (on all the platforms that Restic supports) for subscribers to read the list of changed files periodically.

@tim-seoss
Copy link

@amir73il Thanks for your comments! At the time of writing, the Linux manual page for fanotify says:
Currently, only a limited set of events is supported. In particular, there is no support for create, delete, and move events. - https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/tree/man7/fanotify.7#n32

Is the man page lagging behind improvements in the kernel's functionality? Is the man page the preferred place to look for fanotify documentation?

@rawtaz
Copy link
Contributor

rawtaz commented Nov 24, 2020

There's a lot of talk about detecting file changes. That's probably the easiest part, what's arguably more important is how this would work with the way restic is currently designed on the repository side of things.

@amir73il
Copy link

@tim-seoss This sentence in man page is stale. Thanks for pointing out. I will send a patch to remove it.
Otherwise, the man page is uptodate. See FAN_CREATE, FAN_DELETE, FAN_MOVE and the fanotify_fid.c example.

@anarcat
Copy link
Contributor

anarcat commented Nov 24, 2020

I should also point out https://facebook.github.io/watchman/ as a service that collects fs notification (on all the platforms that Restic supports) for subscribers to read the list of changed files periodically.

There are about half a billion (give or take a billion ;) such programs, and they mostly all rely on the same kernel primitives. I counted 14 here and I keep finding new ones...

that said:

There's a lot of talk about detecting file changes. That's probably the easiest part, what's arguably more important is how this would work with the way restic is currently designed on the repository side of things.

That. The problem is not waking up on changes: you could probably find some obscure Linux API to get a ping with every write on the filesystem. The problem is streaming that meaningfully into a backup system, which is why I raised the question of "continuous backups".

If that's out of scope for this issue, someone just needs to come up with a quick tutorial on how to use one of the aforementioned tools (inotify, watchman, whatever) and call restic on every change. But that's not continuous backups, and could be very hard to scale in restic considering how slow purges are with large numbers of snapshots...

@tim-seoss
Copy link

tim-seoss commented Nov 24, 2020

The minimum useful functionality (which should also be reasonably easy to implement, and is useful for large seldom-changed data sets) would be a daemon that only triggered a restic backup if changes had been made to the filesystem.

This would be a useful addition, and wouldn't require any changes to restic.

Without knowing the restic internals in detail, it seems like a possible next step would be to add an option to restic to enable it to accept list (e.g. via an inherited fd) of files which have changed (i.e. skip the normal filesystem tree walk, and use the supplied list of paths instead).

@rawtaz
Copy link
Contributor

rawtaz commented Nov 25, 2020

@tim-seoss You can already do this outside of restic. Use some tool that records relevant file changes. Have it build a list of files to back up and run restic with the --files-from - or the next-release --files-from-verbatim - or --files-from-raw - equivalents. There's no need to put this into restic, especially considering how much platforms and operating systems differ on the detection part.

@tim-seoss
Copy link

For a long time this functionality was blocked on missing kernel features (at least on Linux). This is no longer the case.

In the 3 years that this bug has been open, both Restic and external tools have gained functionality.

A reasonable resolution to this bug may now be to document (and if appropriate create wrapper scripts etc.) to combine external software with Restic.

@whereisaaron
Copy link

@tim-seoss You can already do this outside of restic. Use some tool that records relevant file changes. Have it build a list of files to back up and run restic with the --files-from - or the next-release --files-from-verbatim - or --files-from-raw - equivalents. There's no need to put this into restic, especially considering how much platforms and operating systems differ on the detection part.

I originally thought these options might work, but they don't work on restore. The snapshot is then only for the files listed, so you can't restore using these snapshots, since the file list doesn't include the unchanged files from the previous snapshot?

@haslersn
Copy link

In order to implement this as a wrapper around restic, restic probably needs the feature to manually pass a list of files that has been changed relative to the parent snapshot. When using this feature of restic backup, it's probably a good idea to make the explicit --parent flag mandatory, to make sure the wrapper software and restic agree on the parent snapshot.

@haslersn
Copy link

haslersn commented Sep 15, 2023

restic probably needs the feature to manually pass a list of files that has been changed relative to the parent snapshot. When using this feature of restic backup, it's probably a good idea to make the explicit --parent flag mandatory, to make sure the wrapper software and restic agree on the parent snapshot.

@aawsome any plans to implement this in rustic?

@haslersn
Copy link

haslersn commented Sep 15, 2023

When working with zfs/btrfs snapshots, a list of files that was changed could also be obtained (instead of using inotify) using tools that report the diff between two snapshots.

@amir73il
Copy link

Not sure this is relevant but fyi inotify-tools/inotify-tools#134 fsnotifywatch can now be used as a standalone watch program

haslersn added a commit to haslersn/restic that referenced this issue Sep 16, 2023
For `restic backup`, support new flags
`--changed-files-from-verbatim` and `--changed-files-from-raw` to
read the files/dirs that actually have changed from a file (or
multiple files). Directories that don't (directly or indirectly)
contain any changed files/dirs will reuse the corresponding subtree
of the parent snapshot.

This option is useful for higher-level backup tools which use
restic as a backend but have their own mechanism of figuring out
which files have changed (e.g., using zfs or btrfs diff tools).
Currently, we require to explicitly pass `--parent` as a
protection mechanism in order to make sure the higher-level backup
tool and restic agree on the parent snapshot. Though the caller
can circumvent this protection mechanism by passing
`--parent latest`.

Caveat: since device IDs are unstable (across reboots or across
different zfs/btrfs snapshots of the same subvolume), the parent
snapshot and current snapshot might have mismatching device IDs.
In this case, the feature will still reuse subtrees of the parent
snapshot (under the conditions mentioned above), so we end up with
a snapshot that contains subtrees with different `device_id`
values, even if there was only a single mountpoint in play.

For now, we could simply document this caveat and discourage users
who rely on correct restoration of hardlinks from using this
feature. When restic#3041 is
properly fixed in the future, then this caveat is probably goes
away, too.

The idea for this feature emerged here:
restic#1502 (comment)
haslersn added a commit to haslersn/restic that referenced this issue Sep 16, 2023
For `restic backup`, support new flags
`--changed-files-from-verbatim` and `--changed-files-from-raw` to
read the files/dirs that actually have changed from a file (or
multiple files). Directories that don't (directly or indirectly)
contain any changed files/dirs will reuse the corresponding subtree
of the parent snapshot.

This option is useful for higher-level backup tools which use
restic as a backend but have their own mechanism of figuring out
which files have changed (e.g., using zfs or btrfs diff tools).
We require to explicitly pass `--parent` as a protection mechanism
in order to make sure the higher-level backup tool and restic agree
on the parent snapshot. Though the caller can circumvent this
protection mechanism by passing `--parent latest`.

Caveat: since device IDs are unstable (across reboots or across
different zfs/btrfs snapshots of the same subvolume), the parent
snapshot and current snapshot might have mismatching device IDs.
In this case, the feature will still reuse subtrees of the parent
snapshot (under the conditions mentioned above), so we end up with
a snapshot that contains subtrees with different `device_id`
values, even if there was only a single mountpoint in play.

For now, we could simply document this caveat and discourage users
who rely on correct restoration of hardlinks from using this
feature. When restic#3041 is
properly fixed in the future, then this caveat is probably goes
away, too.

The idea for this feature emerged here:
restic#1502 (comment)
haslersn added a commit to haslersn/restic that referenced this issue Sep 16, 2023
For `restic backup`, support new flags
`--changed-files-from-verbatim` and `--changed-files-from-raw` to
read the files/dirs that actually have changed from a file (or
multiple files). Directories that don't (directly or indirectly)
contain any changed files/dirs will reuse the corresponding subtree
of the parent snapshot.

This option is useful for higher-level backup tools which use
restic as a backend but have their own mechanism of figuring out
which files have changed (e.g., using zfs or btrfs diff tools).
We require to explicitly pass `--parent` as a protection mechanism
in order to make sure the higher-level backup tool and restic agree
on the parent snapshot. Though the caller can circumvent this
protection mechanism by passing `--parent latest`.

Caveat: since device IDs are unstable (across reboots or across
different zfs/btrfs snapshots of the same subvolume), the parent
snapshot and current snapshot might have mismatching device IDs.
In this case, the feature will still reuse subtrees of the parent
snapshot (under the conditions mentioned above), so we end up with
a snapshot that contains subtrees with different `device_id`
values, even if there was only a single mountpoint in play.

For now, we could simply document this caveat and discourage users
who rely on correct restoration of hardlinks from using this
feature. When restic#3041 is
properly fixed in the future, then this caveat probably goes away,
too.

The idea for this feature emerged here:
restic#1502 (comment)
@haslersn
Copy link

@alphapapa check if my implementation in #4469 suits your needs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests