sheep: avoid diskfull caused by recovery process #185

mitake · 2015-10-28T05:51:56Z

sheep can corrupt its cluster by diskfull with recovery process. For
avoiding this problem, this patch adds a new option -F to sheep. If
this command is passed to the sheep process, every sheep process of
the cluster stops itself if there is a possibility of diskfull during
recovery.

Fixes #59

Signed-off-by: Hitoshi Mitake mitake.hitoshi@lab.ntt.co.jp

mitake · 2015-10-28T05:57:34Z

TODO: another mode for simply skipping recovery and keep cluster running. However, it allows low redundancy situation and a little bit dangerous.

vtolstov · 2015-10-28T07:09:00Z

@mitake thanks, but i'm prefer second suggestion, to keep cluster running. Because node already may have data and provide redundancy for some vdi

mitake · 2015-10-28T13:21:44Z

@vtolstov sure, we'll provide the option later

sirio81 · 2015-11-03T08:30:49Z

2015-10-28 6:51 GMT+01:00 Hitoshi Mitake notifications@github.com:

If
this command is passed to the sheep process, every sheep process of
the cluster stops itself if there is a possibility of diskfull during
recovery.

Hi, thank you for the patch but I wonder that: if the cluster stops (during
recovery)...what do I do next?
I mean, what should I do to have the cluster back running?

mitake · 2015-11-05T02:30:45Z

@sirio81 on the second thought, shutdown -> reboot will cause the same diskfull problem. I'll update this patch for just skipping recovery. Thanks a lot for your comment.

mitake · 2015-11-24T07:34:59Z

@vtolstov @sirio81 updated this PR for just skipping recovery without death. Could you try it?

vtolstov · 2015-11-24T14:01:47Z

I'm try it, but why not do this by default (avoid diskfull and not shutdown sheep ?)

mitake · 2015-12-23T12:45:00Z

The new behavior of avoiding dangerous recovery must be shared by all sheeps in a cluster. So I moved the new -F option to dog cluster format command.

sirio81 · 2015-12-23T14:23:00Z

2015-12-23 13:45 GMT+01:00 Hitoshi Mitake notifications@github.com:

The new behavior of avoiding dangerous recovery must be shared by all
sheeps in a cluster. So I moved the new -F option to dog cluster format
command.

It's not on master yes right?

mitake · 2015-12-23T14:26:52Z

@sirio81 yes, this PR is waiting for testing.

tmenjo · 2016-04-25T01:49:39Z

sheep/recovery.c

+				 * about space consumption by metadata objects
+				 * e.g. inode, ledger
+				 */
+				if (is_data_obj(oids[j]))


I think this should be if (!is_data_obj(oids[j])).

sheep can corrupt its cluster by diskfull with recovery process. For avoiding this problem, this patch adds a new option -F to dog cluster format. If this command is passed during cluster formatting, every sheep process of the cluster skips recovery if there is a possibility of diskfull during recovery. Fixes #59 Signed-off-by: Hitoshi Mitake <mitake.hitoshi@lab.ntt.co.jp>

mitake · 2016-04-25T07:00:07Z

@tmenjo thanks for your review. I resolved the invalid condition of object size calculation and the double counting. Could you review it (not tested yet...)?

tmenjo · 2016-04-25T10:34:26Z

sheep/recovery.c

+				free(key);
+				continue;
+			}
+			rb_insert(&seen_objects, key, node, seen_object_cmp);


tmenjo · 2016-04-25T11:40:30Z

LGTM. I'll ask Matsumura, my colleague, to test this.

tmenjo · 2016-04-26T01:14:58Z

sheep/recovery.c

+				 * TODO: current calculation doesn't consider
+				 * about space consumption by metadata objects
+				 * e.g. inode, ledger
+				 */


I think it's better to use get_objsize() to accumulate required_space_per_node[node_idx] when !is_data_obj(oids[j]) is false.

https://github.com/sheepdog/sheepdog/blob/master/include/sheepdog_proto.h#L532

Non data objects are sparse ones and get_objsize() cannot consider it. Their actual size cannot be computed without fetching them. So currently I'm simply ignoring it.

I see. If there are many deleted VDIs (i.e. sparsed inode objects), required_space_per_node[node_idx] is overestimated. And fetching them is costful. Focusing on data objects seems moderate and good idea.

tmenjo · 2016-04-26T08:48:15Z

I think it is better to document when the new feature is or is NOT effective (or maybe harmful). Because this patch does not count ...

the size of non-data objects such as inode;
temporary disk space needed during recovery for storing objects already replicated enough (i.e. satisfying --copies) and moving between existing nodes, because of vnodes recalculation.

Each of them could lead unexpected disk full, and counting them in code seems difficult or costful. So document is helpful.

I think both of two conditions below should be satisfied to use the new feature properly:

Disk usage of non-data objects is small enough to be ignored, relative to data objects.
Vnodes of each node does not change after a node joins or leaves. This is realized by either of two ways below:
- Let every sheep process have the same capacity disk
- Use --fixedvnodes feature

May I have your idea, @mitake ?

mitake · 2016-04-27T01:58:35Z

@tmenjo I agree, it is still difficult to care about disk space consumption of recovery process. Could you add the doc to wiki or files under doc/? I also agree with the two condition you mentioned is required for this feature. I believe the auto vnode calculation cannot be useful at all. Every user should use --fixedvnodes mode. But the situation of unbalance capacities of disks cannot be avoidable during hardware update.

matsu777 · 2016-04-27T02:15:52Z

I verified this PR with the modified patch.
It seems to work well on condition of this verification.

Here is a procedure which I followed.
I made 3 of 1GB loopback device for 3 sheep nodes.

_# sheep --port 7000 --log dir=/root/sheeptest/log/sheep1,level=debug --cluster zookeeper:localhost:2181 /root/sheeptest/meta/sheep1,/mnt/sheepdisk1/obj -z 1
_# sheep --port 7001 --log dir=/root/sheeptest/log/sheep2,level=debug --cluster zookeeper:localhost:2181 /root/sheeptest/meta/sheep2,/mnt/sheepdisk2/obj -z 2
_# sheep --port 7002 --log dir=/root/sheeptest/log/sheep3,level=debug --cluster zookeeper:localhost:2181 /root/sheeptest/meta/sheep3,/mnt/sheepdisk3/obj -z 3

_# dog cluster format --copies=2 -F
using backend plain store

_# dog vdi create testvdi1 1G
_# dog vdi create testvdi2 0.4G

_# dog node info
Id Size Used Avail Use%
0 982 MB 0.0 MB 982 MB 0%
1 982 MB 0.0 MB 982 MB 0%
2 982 MB 0.0 MB 982 MB 0%
Total 2.9 GB 0.0 MB 2.9 GB 0%

Total virtual image size 1.4 GB

_# dog vdi write testvdi1 < /root/sheeptest/data/test0.9G.raw
_# dog vdi write testvdi2 < /root/sheeptest/data/test0.3G.raw

_# dog cluster info -v
Cluster status: running, auto-recovery enabled
Cluster store: plain with 2 redundancy policy
Cluster vnodes strategy: auto
Cluster vnode mode: node
Cluster created at Tue Apr 26 18:46:49 2016

Epoch Time Version [Host:Port:V-Nodes,,,]
2016-04-26 18:46:49 1 [10.36.4.8:7000:128, 10.36.4.8:7001:128, 10.36.4.8:7002:128]

_# dog node info
Id Size Used Avail Use%
0 982 MB 848 MB 134 MB 86%
1 982 MB 728 MB 254 MB 74%
2 982 MB 840 MB 142 MB 85%
Total 2.9 GB 2.4 GB 529 MB 82%

Total virtual image size 1.4 GB

_# dog node kill 2

_# dog cluster info -v Cluster status: running, auto-recovery enabled
Cluster store: plain with 2 redundancy policy
Cluster vnodes strategy: auto
Cluster vnode mode: node
Cluster created at Tue Apr 26 18:46:49 2016

Epoch Time Version [Host:Port:V-Nodes,,,]
2016-04-26 18:48:14 2 [10.36.4.8:7000:128, 10.36.4.8:7001:128]
2016-04-26 18:46:49 1 [10.36.4.8:7000:128, 10.36.4.8:7001:128, 10.36.4.8:7002:128]

_# dog node info
Id Size Used Avail Use%
0 982 MB 848 MB 134 MB 86%
1 982 MB 728 MB 254 MB 74%
Total 1.9 GB 1.5 GB 387 MB 80%

sheep.log

Apr 26 18:48:14 EMERG [rw 3218] check_diskfull_possibility(1247) node IPv4 ip:10.36.4.8 port:7000 will cause disk full, stopping whole cluster
Apr 26 18:48:14 DEBUG [rw 3218] check_diskfull_possibility(1254) node IPv4 ip:10.36.4.8 port:7000 (space: 1029439488) can store required space during next recovery (1266679808)
Apr 26 18:48:14 EMERG [rw 3218] check_diskfull_possibility(1247) node IPv4 ip:10.36.4.8 port:7001 will cause disk full, stopping whole cluster
Apr 26 18:48:14 DEBUG [rw 3218] check_diskfull_possibility(1254) node IPv4 ip:10.36.4.8 port:7001 (space: 1029439488) can store required space during next recovery (1266679808)
Apr 26 18:48:14 EMERG [rw 3218] prepare_object_list(1291) canceling recovery because of disk full
Apr 26 18:48:14 EMERG [rw 3218] prepare_object_list(1292) please add a new node ASAP

mitake · 2016-04-27T02:20:09Z

@taiyo33 thanks for testing, I'm merging this PR.

tmenjo · 2016-04-28T01:20:09Z

@mitake Sure. I'll make another issue about making document.

mitake force-pushed the recovery-diskfull branch from ae04ea8 to 442b88e Compare November 24, 2015 07:34

mitake force-pushed the recovery-diskfull branch from 442b88e to 07de915 Compare December 23, 2015 12:44

mitake added this to the v1.0 milestone Feb 29, 2016

mitake self-assigned this Feb 29, 2016

tmenjo reviewed Apr 25, 2016
View reviewed changes

mitake force-pushed the recovery-diskfull branch from 07de915 to e989779 Compare April 25, 2016 06:59

tmenjo reviewed Apr 25, 2016
View reviewed changes

sheep/recovery.c

free(key);

continue;

}

rb_insert(&seen_objects, key, node, seen_object_cmp);

Copy link

Contributor

tmenjo Apr 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

tmenjo reviewed Apr 26, 2016
View reviewed changes

mitake merged commit 8c888c0 into master Apr 27, 2016

mitake deleted the recovery-diskfull branch April 27, 2016 02:46

tmenjo mentioned this pull request Apr 28, 2016

doc: document about 'dog cluster format -F' #238

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sheep: avoid diskfull caused by recovery process #185

sheep: avoid diskfull caused by recovery process #185

mitake commented Oct 28, 2015

mitake commented Oct 28, 2015

vtolstov commented Oct 28, 2015

mitake commented Oct 28, 2015

sirio81 commented Nov 3, 2015

mitake commented Nov 5, 2015

mitake commented Nov 24, 2015

vtolstov commented Nov 24, 2015

mitake commented Dec 23, 2015

sirio81 commented Dec 23, 2015

mitake commented Dec 23, 2015

tmenjo Apr 25, 2016 •

edited

mitake commented Apr 25, 2016

tmenjo Apr 25, 2016

tmenjo commented Apr 25, 2016

tmenjo Apr 26, 2016

mitake Apr 26, 2016

tmenjo Apr 26, 2016

tmenjo commented Apr 26, 2016

mitake commented Apr 27, 2016

matsu777 commented Apr 27, 2016 •

edited

mitake commented Apr 27, 2016

tmenjo commented Apr 28, 2016

sheep: avoid diskfull caused by recovery process #185

sheep: avoid diskfull caused by recovery process #185

Conversation

mitake commented Oct 28, 2015

mitake commented Oct 28, 2015

vtolstov commented Oct 28, 2015

mitake commented Oct 28, 2015

sirio81 commented Nov 3, 2015

mitake commented Nov 5, 2015

mitake commented Nov 24, 2015

vtolstov commented Nov 24, 2015

mitake commented Dec 23, 2015

sirio81 commented Dec 23, 2015

mitake commented Dec 23, 2015

tmenjo Apr 25, 2016 • edited

Choose a reason for hiding this comment

mitake commented Apr 25, 2016

tmenjo Apr 25, 2016

Choose a reason for hiding this comment

tmenjo commented Apr 25, 2016

tmenjo Apr 26, 2016

Choose a reason for hiding this comment

mitake Apr 26, 2016

Choose a reason for hiding this comment

tmenjo Apr 26, 2016

Choose a reason for hiding this comment

tmenjo commented Apr 26, 2016

mitake commented Apr 27, 2016

matsu777 commented Apr 27, 2016 • edited

mitake commented Apr 27, 2016

tmenjo commented Apr 28, 2016

tmenjo Apr 25, 2016 •

edited

matsu777 commented Apr 27, 2016 •

edited