Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aof rewrite and rdb save counters in info #10178

Merged
merged 12 commits into from
Feb 17, 2022

Conversation

yoav-steinberg
Copy link
Contributor

@yoav-steinberg yoav-steinberg commented Jan 25, 2022

Add aof_rewrites and rdb_snapshots counters to info. This is useful to figure out if a rewrite or snapshot happened since last check. This was part of the (ongoing) effort to provide a safe backup solution for multipart-aof backups.


Note, it was eventually decided not to include a backup script in the redis repository and instead document how to safely backup AOF files, see: redis/redis-doc#1794

Original top-comment before backup script was removed from this PR:

AOF backup script defined in #10063.

In Redis 7 we introduced multi-part AOFs (#9788) based on a manifest file that indicates which files are part of Redis's AOF persistence. Our updated documentation about backup tells the user to simply copy the directory where all the AOF files (and manifest) reside.

This is wrong and bad advice. The reason is that during the copy files in the directory might change. During a rewrite we might end up copying an old manifest file and new base and increment files. We'll end up with a useless backup.

So this PR adds a backups script that performs the following:

  1. Verify the server isn't performing a rewrite.
  2. Create hard links to files in the directory.
  3. Verify no rewrite started or happened since (1). If it did delete the hard links and go back to (1).
  4. Copy/gzip the hard links.
  5. Delete the hard links.

Other changes:
Include stat counters for aof rewrites and bgsaves:
Useful in general but also practical if we want to make sure a certain operation was performed without any aof rewrite in the middle.

  • fix docs aof backup doc PR accordingly.
  • Decide if we want to use a lock file, a mem flag or just use the existing auto-aof-rewrite-percentage (and fail backup if rewrite is in progress) config to disable rewrites.
  • If we create a new way to abort and disable rewrites then decide if we want it to be a generic way to to abort all forks coupled with auto-aof-rewrite-percentage or a specific command for rewrites.

@yoav-steinberg yoav-steinberg added the state:needs-doc-pr requires a PR to redis-doc repository label Jan 25, 2022
@oranagra oranagra added this to Backlog in 7.0 via automation Jan 25, 2022
@oranagra oranagra moved this from Backlog to In progress in 7.0 Jan 25, 2022
@yoav-steinberg yoav-steinberg changed the title Stat counters for aof rewrites and bgsaves AOF backup script Jan 25, 2022
@yoav-steinberg yoav-steinberg marked this pull request as draft January 25, 2022 10:15
@yoav-steinberg yoav-steinberg marked this pull request as ready for review January 27, 2022 11:51
@oranagra oranagra linked an issue Jan 31, 2022 that may be closed by this pull request
Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yossigo my bash is not as fluent as my tcl 8-)
please review too.

utils/aof_backup.sh Outdated Show resolved Hide resolved
while true; do
aof_rewrites=$(get_info_field aof_rewrites)
if [ $(get_info_field "aof_rewrite_in_progress") -eq 1 ]; then
echo "Redis is performing an AOF rewrite, waiting..."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be very verbose, maybe we should print just the first one? or add a silent mode?

Copy link
Contributor Author

@yoav-steinberg yoav-steinberg Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not to complicate the script with "first time checks". Users can easily remove this or change this. I don't think it's that bad to have your terminal filled with these if you have 1800 seconds of a rewrite.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i do, but we can leave it for Yossi to rule

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @oranagra, no need for a silent mode just a flag to say we've printed this message so let's not print it again as nothing changed.

utils/aof_backup.sh Outdated Show resolved Hide resolved
@oranagra oranagra changed the title AOF backup script Safe multi-part AOF backup script Feb 1, 2022
@yoav-steinberg yoav-steinberg added the release-notes indication that this issue needs to be mentioned in the release notes label Feb 1, 2022
@oranagra oranagra moved this from In progress to In Review in 7.0 Feb 3, 2022
yoav-steinberg and others added 2 commits February 3, 2022 17:23
Co-authored-by: Oran Agra <oran@redislabs.com>
Copy link
Collaborator

@chenyang8094 chenyang8094 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions:

  1. Do we need to support AOF backups when redis not running (When we don't specify -h -p option)
  2. The current implementation is to directly establish a soft link to appenddir, do we need to deal with the case where multiple redis share the same dir (they have the same appenddir), I mean we should strictly follow the manifest file instructions to backup AOF
  3. Do we need to be compatible with the macos ? for example, this code (readlink /proc/$pid/cwd) does not work in macos, but AFAIK probably many people will test this script on macos

@oranagra
Copy link
Member

oranagra commented Feb 8, 2022

btw, can someone check if all this ln and /proc magic works on FreeBSD / MacOS?
@devnexen maybe you can take a look

Ohh @chenyang8094 i missed your above edit.
can you please suggest a readlink alternative that works?

@devnexen
Copy link
Contributor

devnexen commented Feb 8, 2022

I would not rely on procfs for one does not exist on macOs and is optional on FreeBSD there is the procstat command line e.g. procstat pargs <pid>

@yoav-steinberg
Copy link
Contributor Author

@chenyang8094 thanks for the comments,

  1. Do we need to support AOF backups when redis not running (When we don't specify -h -p option)

I think the aim should be to provide a relatively simple script the user can modify or just read to understand how to safely backup their AOF. The goal I had in mind wasn't to provide a full blown backup solution (we don't even do this for rdb). In that respect I think we need to show how to backup a running server, and if the user needs to backup a downed server they can figure it out from the script.

  1. The current implementation is to directly establish a soft link to appenddir, do we need to deal with the case where multiple redis share the same dir (they have the same appenddir), I mean we should strictly follow the manifest file instructions to backup AOF.

Right, I'll try to fix this. Thanks.

  1. Do we need to be compatible with the macos ? for example, this code (readlink /proc/$pid/cwd) does not work in macos, but AFAIK probably many people will test this script on macos

I think this isn't critical, again, this is only an example. But if there's a good way to support both (trying to avoid adding os checks), I'm for it.

@oranagra
Copy link
Member

oranagra commented Feb 8, 2022

i'm not sure about treating this script just as an example. @yossigo please share your opinion.
i may be ok about ignoring a down server, but i rather make the effort to support popular Unixes (depends on the compexity).

regarding multiple servers using the same folder, one option is to parse the manifest file, the other option is to use the prefix / pattern matching. the difference is that it'll include history files, but i suppose that's ok.

@yoav-steinberg
Copy link
Contributor Author

regarding multiple servers using the same folder, one option is to parse the manifest file, the other option is to use the prefix / pattern matching. the difference is that it'll include history files, but i suppose that's ok.

In any case I'll need the prefix in order to find the correct manifest file. I also don't want the complexity of parsing the file. I'll just backup based on appendfilename.* prefix.

@chenyang8094
Copy link
Collaborator

@yoav-steinberg If we distribute the script with the redis source code, I think it will not be an example, and many users will use it, like the redis-check-aof tool. So I think we need to take it more seriously.

- use `lsof` instead or procfs to find server's working dir.
- use no support for `-p` relative dir in `mktemp`
- no support for file change warning suppression in `tar`.
Copy link
Collaborator

@chenyang8094 chenyang8094 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

exit 1
fi

if [ $(get_config appendonly) != yes ]; then
Copy link
Collaborator

@chenyang8094 chenyang8094 Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If redis once enabled AOF (and created many AOF files) but disable it later, i think we need to support this scenario.

Copy link
Member

@yossigo yossigo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few specific comments.

I think there are bigger issues around this script, as @chenyang8094 mentioned: we expect the server to be up, but what if it's down? or restarting? or changing configuration? Not sure we can cover all cases.

I think we should also consider other options where Redis is instructed to hold back a rewrite because a backup is in progress.

pid=$(get_info_field process_id)
appenddirname=$(get_config appenddirname)
appendfilename=$(get_config appendfilename).
working_dir=$(lsof -a -p $pid -d cwd -Fn | grep "^n" | sed "s/n\//\//")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not get_config dir? I wouldn't assume lsof is always available.

Copy link
Contributor Author

@yoav-steinberg yoav-steinberg Feb 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redis dir is relative to the the working directory. I need the working directory to know how to treat all other locations.
Couldn't find any good way to find the working directory on bsd/macos other than lsof. I can check the os and use procf fs on linux. Or verify lsof exists.

appendfilename=$(get_config appendfilename).
working_dir=$(lsof -a -p $pid -d cwd -Fn | grep "^n" | sed "s/n\//\//")
appenddir_path=$working_dir/$appenddirname
if [ ! -d $appenddir_path ]; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if [ ! -d $appenddir_path ]; then
if [ ! -d "$appenddir_path" ]; then

Better practice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This repeats in many places, I did not comment everywhere.

working_dir=$(lsof -a -p $pid -d cwd -Fn | grep "^n" | sed "s/n\//\//")
appenddir_path=$working_dir/$appenddirname
if [ ! -d $appenddir_path ]; then
echo "Couldn't find $appenddir_path. redis-server must be run locally. Aborting."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also fail due to permissions.

appenddirname=$(get_config appenddirname)
appendfilename=$(get_config appendfilename).
working_dir=$(lsof -a -p $pid -d cwd -Fn | grep "^n" | sed "s/n\//\//")
appenddir_path=$working_dir/$appenddirname
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
appenddir_path=$working_dir/$appenddirname
appenddir_path="$working_dir/$appenddirname"

while true; do
aof_rewrites=$(get_info_field aof_rewrites)
if [ $(get_info_field "aof_rewrite_in_progress") -eq 1 ]; then
echo "Redis is performing an AOF rewrite, waiting..."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @oranagra, no need for a silent mode just a flag to say we've printed this message so let's not print it again as nothing changed.

backup_path=$tmp_dir/$appenddirname
mkdir $backup_path

ln $appenddir_path/$appendfilename* $backup_path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add an earlier check to make sure we're not crossing filesystem boundaries here, because links won't work in that case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC we made sure to create the temp folder inside dir, so shouldn't that be covered?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's covered now except if someone mounts another fs under dir which is a case I'm willing to ignore.

@yoav-steinberg
Copy link
Contributor Author

I think there are bigger issues around this script, as @chenyang8094 mentioned: we expect the server to be up, but what if it's down? or restarting? or changing configuration? Not sure we can cover all cases.

I think we should also consider other options where Redis is instructed to hold back a rewrite because a backup is in progress.

@oranagra @yossigo I'm getting the feeling I'm slowly being nudged to create a full blow backup solution for AOF in bash. I originally proposed we drop official support for AOF backups because it's pretty sketchy whether this makes sense given Redis's RDB feature. I'd like us to have a talk about what's the aim of this script... Once we pass a certain level of complexity I think we should start thinking of some export feature in Redis and not a script.

@yoav-steinberg
Copy link
Contributor Author

Had a talk with @yossigo & @oranagra relating to above concerns:

  1. The script is complex and requires redis-cli and other utilities in order to run.
  2. We'll need to maintain the script in the future and we'll probably need to fix stuff like OS support or other usage configurations we haven't thought of.
  3. We'd like a solution that's easy to document and explain to users and will also work if the server is down or restarts during the process.

@yossigo Suggested we create a mechanism for disabling/aborting the rewrite process and persisting this setting between restarts: create an appendonly.lock file in the appendonlydir. Redis will never perform rewrites if this file is present. We'll add a command REWRITEAOF_LOCK ON|OFF which creates the lock file aborts any running aof rewrite (on ON) and deletes the lock file on OFF.

After the server is protected against doing rewrites we can safely create hard links/copy the dir or whatever we want. Then we can delete the lock file or call the command with OFF.

We'll document this mechanism and it's logic so users can create their own backup scripts/logic instead of having to maintain a script that handles all edge cases.

@chenyang8094
Copy link
Collaborator

chenyang8094 commented Feb 10, 2022

I don't think it's a good suggestion to create appendonly.lock files for the following reasons:

  1. AOFRW has to perform a file system call to check if the file exists before executing
  2. Once we or the user forget to delete the appendonly.lock file, we will have an increasing AOF
  3. If the user accidentally copies and backs up appendonly.lock together, when a redis uses this backup in the future, it will not be able to execute AOFRW.

@oranagra
Copy link
Member

maybe the lock file should be in dir and not in appendonlydir? (will reduce the chance someone copies it).
or maybe we can relax the concern about server restart, and just keep this flag in memory.

@yoav-steinberg
Copy link
Contributor Author

maybe the lock file should be in dir and not in appendonlydir? (will reduce the chance someone copies it).

I'm afraid we need it in appendonlydir to make sure it doesn't conflict with any other Redis instance. It'll need to be under <dir>/<appendonlydir>/<appendonlyname>.lock. I think this is the only way to make sure it won't conflict with other Redises.

or maybe we can relax the concern about server restart, and just keep this flag in memory.

We wanted to make sure we have a solution for these cases. So I'm not sure we want to back down from that now. What we can do is print a big warning if Redis starts with this in the directory, or even fail to start in such a case.

@yossigo
Copy link
Member

yossigo commented Feb 10, 2022

I think keeping the flag in memory might be enough. A server that restarts and immediately rewrites and completes the rewrite before creating links of all files is possible, but very unlikely..

@chenyang8094
Copy link
Collaborator

chenyang8094 commented Feb 11, 2022

I think we can keep the changes in stat_aof_rewrites and stat_rdb_saves in this PR, they do work in reality and give us the opportunity to safely back up multi part aof.

Also, if we don't plan to maintain and improve this backup script, we shouldn't release it in the redis source code, instead, we can put it in the redis documentation as a suggestion or an example for users to refer to (the only uncertainty is that how many people will actually see it).

Another suggestion is that we abandon the use of scripts and instead directly implement the C code tool of redis-backup-aof (similar to redis-check-aof), so that we can abandon the dependence on tools such as redis-cli, so it can be distributed in redis code and be an AOF tool out of the box.

p.s. If we want to temporarily disable AOFRW, we can set auto-aof-rewrite-percentage to 0, which will disable auto-triggered AOFRW (I believe users don't intentionally trigger AOFRW manually when backing up). And If we only want to know whether AOFRW occurred during the backup process, we also can detect whether the content of the manifest file has changed (for example, create a manifest MD5 when starting the backup)

@yoav-steinberg
Copy link
Contributor Author

I think we can keep the changes in stat_aof_rewrites and stat_rdb_saves in this PR, they do work in reality and give us the opportunity to safely back up multi part aof.

👍 I'll change this PR to only include the stats and update the top comment.

Also, if we don't plan to maintain and improve this backup script, we shouldn't release it in the redis source code, instead, we can put it in the redis documentation as a suggestion or an example for users to refer to (the only uncertainty is that how many people will actually see it).

👍 I agree. So given the intention to make the mechanism simpler by adding a block rewrite mechanism, we'll just document this and forget about the script.

Another suggestion is that we abandon the use of scripts and instead directly implement the C code tool of redis-backup-aof (similar to redis-check-aof), so that we can abandon the dependence on tools such as redis-cli, so it can be distributed in redis code and be an AOF tool out of the box.

I think this is too complex and will hide the internals of the mechanism which we want (advanced) users to be able to understand and customize for their own needs.

p.s. If we want to temporarily disable AOFRW, we can set auto-aof-rewrite-percentage to 0, which will disable auto-triggered AOFRW (I believe users don't intentionally trigger AOFRW manually when backing up). And If we only want to know whether AOFRW occurred during the backup process, we also can detect whether the content of the manifest file has changed (for example, create a manifest MD5 when starting the backup)

This is a decent option, but bear in mind that changing auto-aof-rewrite-percentage to 0 won't abort a running rewrite. So adding a mechanism for this will make performing backups simpler (no need to check if there's a rewrite after disabling rewrites).

@yoav-steinberg yoav-steinberg changed the title Safe multi-part AOF backup script aof rewrite and rdb save counters in info Feb 14, 2022
Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@redis/core-team please approve two new counters in INFO stats.

@oranagra oranagra moved this from In Review to Awaits merge in 7.0 Feb 16, 2022
@oranagra oranagra merged commit 56fa48f into redis:unstable Feb 17, 2022
@oranagra oranagra added the state:major-decision Requires core team consensus label Feb 17, 2022
@oranagra oranagra moved this from Awaits merge to Unreleased in 7.0 Feb 17, 2022
ranshid pushed a commit to ranshid/redis that referenced this pull request Feb 20, 2022
Add aof_rewrites and rdb_snapshots counters to info.
This is useful to figure our if a rewrite or snapshot happened since last check.
This was part of the (ongoing) effort to provide a safe backup solution for multipart-aof backups.
@oranagra oranagra mentioned this pull request Feb 28, 2022
@oranagra oranagra moved this from Unreleased to Done in 7.0 Mar 1, 2022
@@ -1471,6 +1471,7 @@ int rdbSaveBackground(int req, char *filename, rdbSaveInfo *rsi) {
pid_t childpid;

if (hasActiveChildProcess()) return C_ERR;
server.stat_rdb_saves++;
Copy link
Collaborator

@enjoy-binbin enjoy-binbin Jun 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i notice we only do server.stat_rdb_saves++ in rdbSaveBackground
and did not do it in SAVE command, (or in FLUSHALL it may trigger rdbsave) (or other)

I think it should be counted in the save command?
or maybe we should put it near server.lastsave?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree that we should increment it in the SAVE command too, but i'm not sure about the others.

  1. i think there's value (or at least it's very different) setting it before we fork, rather than at completion (in which case we need to decide if we count failures too).
  2. i don't think the implicit save of FLUSHALL should count, arguably it's the same as deleting the old persistence file, rather than saving an empty one.
  3. i don't think we should count the empty rdb we save when creating a an empty base for a multipart AOF when starting up empty.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the one in SAVE command is my first thinking, and then I looked at the others (mention in case we forget).
as long as we agree with the stat field meaning, just a random look
i will make a PR with the SAVE one

enjoy-binbin added a commit to enjoy-binbin/redis that referenced this pull request Jun 7, 2022
Currently, we only increment stat_rdb_saves in rdbSaveBackground,
we should also increment it in the SAVE command.

The stat counter was introduced in redis#10178
oranagra pushed a commit that referenced this pull request Jun 7, 2022
Currently, we only increment stat_rdb_saves in rdbSaveBackground,
we should also increment it in the SAVE command.

We concluded there's no need to increment when:
1. saving a base file for an AOF
2. when saving an empty rdb file to delete an old one
3. when saving to sockets (not creating a persistence / snapshot file)

The stat counter was introduced in #10178

* fix a wrong comment in startSaving
Mixficsol pushed a commit to Mixficsol/redis that referenced this pull request Apr 12, 2023
Currently, we only increment stat_rdb_saves in rdbSaveBackground,
we should also increment it in the SAVE command.

We concluded there's no need to increment when:
1. saving a base file for an AOF
2. when saving an empty rdb file to delete an old one
3. when saving to sockets (not creating a persistence / snapshot file)

The stat counter was introduced in redis#10178

* fix a wrong comment in startSaving
enjoy-binbin added a commit to enjoy-binbin/redis that referenced this pull request Jul 31, 2023
Currently, we only increment stat_rdb_saves in rdbSaveBackground,
we should also increment it in the SAVE command.

We concluded there's no need to increment when:
1. saving a base file for an AOF
2. when saving an empty rdb file to delete an old one
3. when saving to sockets (not creating a persistence / snapshot file)

The stat counter was introduced in redis#10178

* fix a wrong comment in startSaving
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes indication that this issue needs to be mentioned in the release notes state:major-decision Requires core team consensus state:needs-doc-pr requires a PR to redis-doc repository
Projects
Archived in project
7.0
Done
Development

Successfully merging this pull request may close these issues.

backing up multi-prart-aof
7 participants