Permalink
Fetching contributors…
Cannot retrieve contributors at this time
412 lines (336 sloc) 21.7 KB
/*
* 2013+ Copyright (c) Kirill Smorodinnikov <shaitkir@gmail.com>
*
* This file is part of Elliptics.
*
* Elliptics is free software: you can redistribute it and/or modify
* it under the terms of the GNU Lesser General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Elliptics is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU Lesser General Public License for more details.
*
* You should have received a copy of the GNU Lesser General Public License
* along with Elliptics. If not, see <http://www.gnu.org/licenses/>.
*/
/*!
\page recovery.html Recovery
\brief Synchronization data within one group and between different groups
\tableofcontents
\section introduction Introduction
Recovery is a process of restoring data consistency within one group and/or between groups.
Recovery supports 2 modes: `merge` and `dc`. Each of them solves his problem.
- `merge` mode restores consistency within one group and redistributes data between all nodes
- `dc` mode restores consistency between groups (replicas)
Main tool for recovering data is dnet_recovery. It is distributed with elliptics-client package.
Usually recovery consists of number of parallel steps:
- determination of which key ranges needed to be recovered and from which nodes
- running iterators on those nodes for particular ranges
- computation of difference between local and remote iterator results
- and finally recovery of resulting difference
This process somehow varies between `merge` and `dc`.
but this still describes it enogh for high-level overview.
\section options Recovery options
dnet_recovery has follow options:
- <b>-r ELLIPTICS_REMOTE, --remote=ELLIPTICS_REMOTE</b> - elliptics node address.
Entrypoint into elliptics cluster. It has to be in Elliptics format adress:port:family.
On Linux IPv4 family specified with 2 and IPv6 with 10.
This is mandatory option which should be always specified.
Most likely dnet_recovery will be run via cron with -r $(hostname -f):1025:2.
<b>-r/--remote</b> can be used multiple times - one per node. More number of remotes can help with gathering full route list.
- <b>-g ELLIPTICS_GROUPS, --groups=ELLIPTICS_GROUPS</b> - Comma separated list of groups.
This is mandatory option which should be always specified.
- <b>-n NPROCESS, --nprocess=NPROCESS</b> - Number of subprocesses. This parameter influence on recovery speed.
Each subprocess will parallel iterate, sort iteration results and recover object from one node.
Former should be set to number of disks in RAID for IO-bound workloads and to number of CPUs
for cpu-bound workloads (of cause if system needs to be able to serve requests at the time
of recovery this option should be bounded even more). <b>[default: 1]</b>
- <b>-o ELLIPTICS_REMOTE, --one-node=ELLIPTICS_REMOTE</b> - recovery will iterates only specified node, so only it's keys will be recovered.
Nodes provided by <b>-r/--remote</b> will be used for gathering full route list.
<b> -i BACKEND_ID, --backend_id=BACKEND_ID - Specifies backend data on which should be recovered. IT WORKS ONLY WITH --one-node
- <b>-b BATCH_SIZE, --batch-size=BATCH_SIZE</b> - Number of keys which will be read/write/remove at once by one subprocess.
While bigger values speed up recovery they also consume more memory (memory consumption can be computed as avg_record_size * batch size * 2). <b>[default: 1024]</b>
- <b>-t TIMESTAMP, --time=TIMESTAMP</b> - Recover keys modified since `time`. Can be specified
as timestamp or as time difference e.g.: `1368940603`, `12h`, `1d`, or `4w`.
Latter is useful for cron jobs - for example one can put a cron job to recover only keys
modified within last month running each week - this will have pretty decent speed but also most likely be equivalent to full recovery.
- <b>-D DIR, --dir=DIR</b> - Temporary directory for iterators' results, statistics and other temporary files
By default script uses /var/tmp/dnet_recovery_%MODE% temporary directory.
- <b>-l FILE, --log=FILE</b> - Output log messages from library to file <b>[default: dnet_recovery.log]</b>
- <b>-L ELLIPTICS_LOG_LEVEL, --log-level=ELLIPTICS_LOG_LEVEL</b> - Elliptics client verbosity <b>[default: 1]</b>
- <b>-k LOCKFILE, --lock=LOCKFILE</b> - Lock file used for recovery <b>[default: dnet_recovery.lock]</b>
By default only one instance of any recovery type can be run.
- <b>-N, --dry-run</b> - Enable test mode: only count diffs without recovering. No data will be read or written or removed.
- <b>-s STAT, --stat=STAT</b> - Statistics output format: none/text <b>[default: text]</b>
- <b>-S, --safe</b> - Do not remove recovered keys after merge.
- <b>-e, --no-exit</b> - Will be waiting for user input at the finish.
- <b>-m MONITOR_PORT, --monitor-port=MONITOR_PORT</b> - Enable remote monitoring on provided port.
- <b>-w WAIT_TIMEOUT, --wait-timeout=WAIT_TIMEOUT</b> - Wait timeout for elliptics operations <b>[default: 3600]</b>
- <b>-a ATTEMPTS, --attemps=ATTEMPTS</b> - Number of attempts to recover one key
- <b>-c CHUNK_SIZE, --chunk-size=CHUNK_SIZE</b> - Size of chunk by which all object will be read and recovered <b>[default: 1048576]</b>
- <b>-C CUSTOM_RECOVER, --custom-recover=CUSTOM_RECOVER</b> - Sets custom recover app
that will be used by `dc` for recovering data.
- <b>-f DUMP_FILE, --dump-file=DUMP_FILE</b> - Sets dump file which contains hex ids of object that should be recovered.
Recovery instead of scanning all existing keys will check and recover only keys from dump file.
\section statistics Common approach to statistics
Recovery stats has several section each describes some stage of recovering except main secion. All counters in stats has 3 values:
- <b>some_counter_success</b> - count of succeeded operations or number of keys which was successfully read or written
- <b>some_counter_failed</b> - count of failed operations or number of keys which hasn't been read or written
- <b>some_counter_total = counter_success + counter_failed</b>
For simplifing reading statistics if `some_counter_failed` value is zero it will be hided from output statistics,
also `some_counter_total` will be hided and `some_counter_success` will be shortened to `some_counter`.
Main section has title “monitor” and contains information about all recovery:
- <b>main_started</b> - when recovery was started
- <b>main_finished</b> - when recovery was finished
Other statistics has mode-relative nature and will be described later.
\section merge Merge recovery
Mege recovery restores data consistency within one group.
The reason for that could be hardware failure or extending cluster by adding empty nodes to the group.
\subsection merge_algo How merge recovery works
For each nodes merge recovery makes follow steps:
-# from route list determines which ranges don't covered by the node
-# run iterator on the node and collect keys with metadata that shouldn't be on the node
-# goes throw iterated keys and dumps them.
Those dump files could be used for resuming recovery after some issues.
Dump files has name 'dump_{ip}' where {ip} is ip of iterated node and
are placed in tmp_dir specified by <b>-D DIR, --dir=DIR</b>.
-# for each collected key:
-# make lookup to the proper node
-# compare timestamp and size of key on proper node (if the key exists on proper node)
-# if it is necessary copy key from origin node to proper node
-# remove key from origin node (this step can be skipped if run dnet_recovery with '-S' option)
By option '<b>-o, --one-node</b>' recovery can be limited to iterate only one specified node.
It means that only node specified by <b>-o/--one-node</b> will be iterated and only its keys which shouldn't be on it will be moved to the proper node.
By combining options '<b>-o, --one-node</b>', '<b>-i, --backend-id</b>' recovery can be limited to iterate only one specified backend of one specified node.
It means that only backend specified by '<b>-i, --backend-id</b>' on node specified by <b>-o/--one-node</b> will be iterated and only its keys which shouldn't be on it will be moved to the proper node.
By default, merge will process all nodes from groups specified by <b>-g/--groups</b>.
If merge recovery runs with <b>-f/--dump-file</b> merge will make follow steps:
-# for each group from <b>-g/--groups</b>:
-# for each key from dump file:
-# try to read metadata of key from all nodes in groups
-# from all read metadata finds out replica with max timestamp and max size
-# check if proper node already has newer version of key
-# if it has not -> copy newer version to proper node
-# remove key from non-proper nodes
\subsection merge_statistics Merge statistics
- <b>group_%group%</b> section:
- <b>group_started</b> = time when processing group was started
- <b>started_finished</b> = time spent on processing group started
- <b>node_%ip%:%port%:%family% %group%</b> section:
- <b>filtered_keys</b> - number of keys that were selected by iterator
- <b>iterated_keys</b> - number of keys that were processed by iterator
- <b>total_keys</b> - total number of keys on the node
- <b>iterations</b> - number of iterations
- <b>local_read_bytes</b> - number of bytes that was read from the node
- <b>local_reads</b> - number of read operations
- <b>local_remove_retries</b> - number of extra retries to remove object
- <b>local_removed_bytes</b> - number of bytes that was removed from the node
- <b>local_removes</b> - number of removes from the node
- <b>local_removes_old</b> - number of object that was just removed from the node.
- <b>remote_lookups</b> - number of remote lookups
- <b>remote_writes</b> - number of remote writes
- <b>process_started</b> - time when processing node was started
- <b>process_started...iterate</b> - time spent on preparation
- <b>process_iterate...recover</b> - time spent on iteration and sorting results
- <b>process_recover...finished</b> - time spent on recovering (reading/writting/removing) objects
\subsection merge_logs Merge logs
\section dc DC recovery
DC recovery restores data consistency between groups (replicas). It copies key/data from replicas
that contains new version of object to replicas with outdated or missed keys.
DC recovery also recovers indexes by merging index shards from different groups.
DC recovery provides interface for using custom user-wrote recovery for user data when it is necessary.
User can use index recovering as an example for writting smart recovery for own format of data.
By option '<b>-o, --one-node</b>' recovery can be limited to process only one specified node ranges.
It means that only ranges of node specified by <b>-o/--one-node</b> will be iterated and only its keys from all groups will be compared and recovered.
By combining options '<b>-o, --one-node</b>', '<b>-i, --backend-id</b>' recovery can be limited to process only ranges of one specified backend of one specified node.
It means that only ranges from backend specified by '<b>-i, --backend-id</b>' on node specified by <b>-o/--one-node</b> will be iterated and only its keys from all groups will be compared and recovered.
By default, merge will process all nodes from groups specified by <b>-g/--groups</b>.
\subsection dc_algo How DC recovery works
For specified node `dc` recovery makes follow steps:
-# from route list determines which ranges are covered by the node
-# for all found ranges determines nodes from other groups (specified by -g)
and intersects ranges for origin node and nodes from other groups.
Thus we determines number of small ranges that are covered by origin node and
for each range keeps which node from other groups covers it
-# run iterator on collected node for determined ranges
-# goes throw iterated results and fill merged result:
for each keys it saves key with information on which node and with which timestamp/size/user_flags it exists.
-# goes throw merged results and dumps all iterated keys.
This dump file could be used for resuming recovery after some issues.
Dump file has name 'dump' and is placed in tmp_dir specified by <b>-D DIR, --dir=DIR</b>.
-# distributes keys among buckets:
one bucket with name 'rest_keys' will contain keys which can't be recovered via server-send
number of group buckets with name 'bucket_xx' ('xx' is group_id) will contain keys
which can be recovered via server-send from corresponding group.
-# runs custom recovery (or built-in implementation of it) against merged results.
If dc was run with option <b>-f/--dump-file</b> it will make follow steps for each key from dump file:
-# looks up for metadata to all specified groups
-# saves this infos to merged file (like in common case)
-# runs custom recovery (or built-in implementation of it) against merged results.
Built-in implementation of custom recovery for each key from merged results consists of two steps: server-send recovery with following rest-keys recovery.\n
If key could not be recovered via server-send, then it will be processed during rest-keys recovery.\n
Server-send recovery:
-# while there are non-empty buckets:
-# get next bucket - file that contains newest keys from a specific group
-# iterate over bucket keys by batch:
-# for each key in the batch:
-# check that key can be recovered via server-send: key size is less than chunk size
-# otherwise, move key to 'rest_keys'
-# divide batch into multiple parts, where one part contains keys that should be recovered to the specific combination of destination groups:
-# for each part call server-send and check server-send results:
-# if server-send failed to send some keys:
-# with timeout error - retry to send them max attempts times
-# with other errors - move them into the buckets from which those keys can be recovered
if there is no such bucket, move these keys into 'rest_keys'
Rest-keys recovery:
-# finds node that has newest version
-# reads key from that node
-# check if data for the key is an index file
-# if it is index:
-# reads all version of the object from other groups
-# merges data from all files
-# writes merged index to all groups
-# if it is regular object:
-# writes object to the groups which has older or missed object
\subsection dc_statistics DC statistics
- <b>monitor</b> section:
- <b>main_started...transpose</b> - time spent on iterating, sorting iterator results
- <b>main_transpose...merge</b> - time spent on transposing results
- <b>main_merge...filter</b> - time spent on merging results
- <b>main_filter...finished</b> - time spent on filtering and recovering results
- <b>iterate_%ip%:%port%:%family% %group%</b> section:
- <b>filtered_keys</b> - number of keys that were selected by iterator
- <b>iterated_keys</b> - number of keys that were processed by iterator
- <b>total_keys</b> - total number of keys on the node
- <b>iterations</b> - number of iterations
- <b>process_started</b> - time when node iteration was started
- <b>process_started...iterate</b> - time spent on preparation
- <b>process_iterate...sort</b> - time spent on iteration
- <b>process_sort...finished</b> - time spent on sorting iteration results
- <b>process_finished</b> - time when iteration&sort was finished
- <b>recover</b> section:
- <b>read_bytes</b> - number of bytes that was read
- <b>reads</b> - number of reads that was made
- <b>writes</b> - number of writes that was made
- <b>written_bytesM</b> - number of bytes that was written
\subsection dc_logs DC logs
\section examples Examples
Let our elliptics cluster consists of 3 groups and 3 nodes per group.
For more detail let:\n
group 1 has nodes:
\code
host_1_1:1025:2
host_1_2:1025:2
host_1_3:1025:2
\endcode
group 2 has nodes:
\code
host_2_1:1025:2
host_2_2:1025:2
host_2_3:1025:2
\endcode
group 3 has nodes:
\code
host_3_1:1025:2
host_3_2:1025:2
host_3_3:1025:2
\endcode
\subsection merge_hw_failure Merge recovering after hardware failures
Once problem with hardware was occured at one of the nodes from our cluster.
Some time one group (let it be group #1) was working without one node.
For this time serviceable nodes was responding for key from problematic node.
When this node be restored it will have some keys outdated or missed.
To deal with such situation use `merge` recovery on all serviceable nodes from problematic groups.
Let host_1_2:1025:2 be the node with hardware issue. For synchronization data within group 1 after
restoring node host_1_2:1025:2 dnet_recovery should be used with follow parameters:
\code
dnet_recovery merge -r host_1_1:1025:2 -g 1
\endcode
or
\code
dnet_recovery merge -o host_1_1:1025:2 -g 1
dnet_recovery merge -o host_1_3:1025:2 -g 1
\endcode
Second way allows to run several dnet_recovery in parallel
and run each of them directly near the node with which it will process.
If hardware issues were occured at the several groups (let it be 1 and 2), use follow parameters:
\code
dnet_recovery merge -r host_1_1:1025:2 -g 1,2
\endcode
\subsection merge_new_nodes Merge recovering after adding new empty nodes
Once we decided that current cluster capacity is not enough and we should add several nodes to group #3.
After configurating and starting new nodes we need to move keys from old nodes to new ones.
For this we should use `merge` recovery with follow parameters:
\code
dnet_recovery merge -r host_1_1:1025:2 -g 1
\endcode
or
\code
dnet_recovery merge -o host_3_1:1025:2 -g 1
dnet_recovery merge -o host_2_1:1025:2 -g 1
dnet_recovery merge -o host_2_1:1025:2 -g 1
\endcode
Second way allows to run several dnet_recovery in parallel
and run each of them directly near the node with which it will process.
If serveral groups (let it be 1 and 3) have to be enlarged by adding new nodes, use follow parameters:
\code
dnet_recovery merge -r host_1_1:1025:2 -g 1,3
\endcode
\subsection dc_nhw_failure DC recovering after network or hardware failures
Lets all 3 groups of our cluster are located in different 3 DataCenters.
For some time connection with one of DataCenter (elliptics group) has been lost.
After restoring connection data in this replic can be outdated and/or missed.
Let group 2 be problematic group. For restoring consistency between replicas use `dc` recovery with follow parameters:
\code
dnet_recovery dc -r host_2_1:1025:2 -g 1,2,3
dnet_recovery dc -r host_2_2:1025:2 -g 1,2,3
dnet_recovery dc -r host_2_3:1025:2 -g 1,2,3
\endcode
If the timestamp of issue is known that dnet_recovery can check only keys changed
from this timestamp by using -t dnet_recovery option.
\subsection dc_dump_failure DC recovering from dump file after some failures
As a result of scanning logs we have found that some keys are missed/outdated in some groups.
Grep hex keys of all found keys and write it to dump file like follow:
\code{dump_file}
1f40fc92da241694750979ee6cf582f2d5d7d28e18335de05abc54d0560e0f5302860c652bf08d560252aa5e74210546f369fbbbce8c12cfc7957b2652fe9a75
5267768822ee624d48fce15ec5ca79cbd602cb7f4c2157a516556991f22ef8c7b5ef7b18d1ff41c59370efb0858651d44a936c11b7b144c48fe04df3c6a3e8da
acc28db2beb7b42baa1cb0243d401ccb4e3fce44d7b02879a52799aadff541522d8822598b2fa664f9d5156c00c924805d75c3868bd56c2acb81d37e98e35adc
....
5ae625665f3e0bd0a065ed07a41989e4025b79d13930a2a8c57d6b4325226707d956a082d1e91b4d96a793562df98fd03c9dcf743c9c7b4e3055d4f9f09ba015
\endcode
For synchronization only these keys from all groups (replicas) use `dc` recovery with follow parameters:
\code
dnet_recovery dc -r host_1_1:1025:2 -g 1,2,3 -f /path/to/dump_file
\endcode
\subsection merge_dump_failure Merge recovering from dump file after some failures
As a result of scanning logs we have found that some keys are missed/outdated in some groups.
Grep hex keys of all found keys and write it to dump file like follow:
\code{dump_file}
1f40fc92da241694750979ee6cf582f2d5d7d28e18335de05abc54d0560e0f5302860c652bf08d560252aa5e74210546f369fbbbce8c12cfc7957b2652fe9a75
5267768822ee624d48fce15ec5ca79cbd602cb7f4c2157a516556991f22ef8c7b5ef7b18d1ff41c59370efb0858651d44a936c11b7b144c48fe04df3c6a3e8da
acc28db2beb7b42baa1cb0243d401ccb4e3fce44d7b02879a52799aadff541522d8822598b2fa664f9d5156c00c924805d75c3868bd56c2acb81d37e98e35adc
....
5ae625665f3e0bd0a065ed07a41989e4025b79d13930a2a8c57d6b4325226707d956a082d1e91b4d96a793562df98fd03c9dcf743c9c7b4e3055d4f9f09ba015
\endcode
For synchronization only these keys within group 1 (replicas) use `merge` recovery with follow parameters:
\code
dnet_recovery merge -r host_1_1:1025:2 -g 1 -f /path/to/dump_file
\endcode
\subsection resuming_recovery Resuming recovery after some issues
Both `dc` and `merge` recovery dumps keys that was found after node iteration.
`dc` dump is a one dump file with all keys from all iterated ranges from all iterated nodes and groups.
`merge` dump is a number of dump files. One per iterated node.
This dump files could be used for resuming recovery after some issues that was occurred while recovering.
For resuming recovery run dnet_recovery with <b>-f DUMP_FILE, --dump-file=DUMP_FILE</b> and
path to dump file that was generated by previous run of dnet_recovery:
\code
dnet_recovery dc -r host_1_1:1025:2 -g 1,2,3 -f /path/to/dump_file
\endcode
or
\code
dnet_recovery merge -r host_1_1:1025:2 -g 1 -f /path/to/dump_file_host_1_1\:1025\:2\ 1
\endcode
*/