Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sheep: avoid diskfull caused by recovery process #185

Merged
merged 1 commit into from Apr 27, 2016
Merged

Conversation

mitake
Copy link
Contributor

@mitake mitake commented Oct 28, 2015

sheep can corrupt its cluster by diskfull with recovery process. For
avoiding this problem, this patch adds a new option -F to sheep. If
this command is passed to the sheep process, every sheep process of
the cluster stops itself if there is a possibility of diskfull during
recovery.

Fixes #59

Signed-off-by: Hitoshi Mitake mitake.hitoshi@lab.ntt.co.jp

@mitake
Copy link
Contributor Author

mitake commented Oct 28, 2015

TODO: another mode for simply skipping recovery and keep cluster running. However, it allows low redundancy situation and a little bit dangerous.

@vtolstov
Copy link
Contributor

@mitake thanks, but i'm prefer second suggestion, to keep cluster running. Because node already may have data and provide redundancy for some vdi

@mitake
Copy link
Contributor Author

mitake commented Oct 28, 2015

@vtolstov sure, we'll provide the option later

@sirio81
Copy link
Contributor

sirio81 commented Nov 3, 2015

2015-10-28 6:51 GMT+01:00 Hitoshi Mitake notifications@github.com:

If
this command is passed to the sheep process, every sheep process of
the cluster stops itself if there is a possibility of diskfull during
recovery.

Hi, thank you for the patch but I wonder that: if the cluster stops (during
recovery)...what do I do next?
I mean, what should I do to have the cluster back running?

@mitake
Copy link
Contributor Author

mitake commented Nov 5, 2015

@sirio81 on the second thought, shutdown -> reboot will cause the same diskfull problem. I'll update this patch for just skipping recovery. Thanks a lot for your comment.

@mitake
Copy link
Contributor Author

mitake commented Nov 24, 2015

@vtolstov @sirio81 updated this PR for just skipping recovery without death. Could you try it?

@vtolstov
Copy link
Contributor

I'm try it, but why not do this by default (avoid diskfull and not shutdown sheep ?)

@mitake
Copy link
Contributor Author

mitake commented Dec 23, 2015

The new behavior of avoiding dangerous recovery must be shared by all sheeps in a cluster. So I moved the new -F option to dog cluster format command.

@sirio81
Copy link
Contributor

sirio81 commented Dec 23, 2015

2015-12-23 13:45 GMT+01:00 Hitoshi Mitake notifications@github.com:

The new behavior of avoiding dangerous recovery must be shared by all
sheeps in a cluster. So I moved the new -F option to dog cluster format
command.

It's not on master yes right?

@mitake
Copy link
Contributor Author

mitake commented Dec 23, 2015

@sirio81 yes, this PR is waiting for testing.

@mitake mitake added this to the v1.0 milestone Feb 29, 2016
@mitake mitake self-assigned this Feb 29, 2016
* about space consumption by metadata objects
* e.g. inode, ledger
*/
if (is_data_obj(oids[j]))
Copy link
Contributor

@tmenjo tmenjo Apr 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be if (!is_data_obj(oids[j])).

sheep can corrupt its cluster by diskfull with recovery process. For
avoiding this problem, this patch adds a new option -F to dog cluster
format. If this command is passed during cluster formatting, every
sheep process of the cluster skips recovery if there is a possibility
of diskfull during recovery.

Fixes #59

Signed-off-by: Hitoshi Mitake <mitake.hitoshi@lab.ntt.co.jp>
@mitake
Copy link
Contributor Author

mitake commented Apr 25, 2016

@tmenjo thanks for your review. I resolved the invalid condition of object size calculation and the double counting. Could you review it (not tested yet...)?

free(key);
continue;
}
rb_insert(&seen_objects, key, node, seen_object_cmp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@tmenjo
Copy link
Contributor

tmenjo commented Apr 25, 2016

LGTM. I'll ask Matsumura, my colleague, to test this.

* TODO: current calculation doesn't consider
* about space consumption by metadata objects
* e.g. inode, ledger
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to use get_objsize() to accumulate required_space_per_node[node_idx] when !is_data_obj(oids[j]) is false.

https://github.com/sheepdog/sheepdog/blob/master/include/sheepdog_proto.h#L532

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non data objects are sparse ones and get_objsize() cannot consider it. Their actual size cannot be computed without fetching them. So currently I'm simply ignoring it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. If there are many deleted VDIs (i.e. sparsed inode objects), required_space_per_node[node_idx] is overestimated. And fetching them is costful. Focusing on data objects seems moderate and good idea.

@tmenjo
Copy link
Contributor

tmenjo commented Apr 26, 2016

I think it is better to document when the new feature is or is NOT effective (or maybe harmful). Because this patch does not count ...

  1. the size of non-data objects such as inode;
  2. temporary disk space needed during recovery for storing objects already replicated enough (i.e. satisfying --copies) and moving between existing nodes, because of vnodes recalculation.

Each of them could lead unexpected disk full, and counting them in code seems difficult or costful. So document is helpful.

I think both of two conditions below should be satisfied to use the new feature properly:

  1. Disk usage of non-data objects is small enough to be ignored, relative to data objects.
  2. Vnodes of each node does not change after a node joins or leaves. This is realized by either of two ways below:
    • Let every sheep process have the same capacity disk
    • Use --fixedvnodes feature

May I have your idea, @mitake ?

@mitake
Copy link
Contributor Author

mitake commented Apr 27, 2016

@tmenjo I agree, it is still difficult to care about disk space consumption of recovery process. Could you add the doc to wiki or files under doc/? I also agree with the two condition you mentioned is required for this feature. I believe the auto vnode calculation cannot be useful at all. Every user should use --fixedvnodes mode. But the situation of unbalance capacities of disks cannot be avoidable during hardware update.

@matsu777
Copy link
Contributor

matsu777 commented Apr 27, 2016

I verified this PR with the modified patch.
It seems to work well on condition of this verification.

Here is a procedure which I followed.
I made 3 of 1GB loopback device for 3 sheep nodes.


_# sheep --port 7000 --log dir=/root/sheeptest/log/sheep1,level=debug --cluster zookeeper:localhost:2181 /root/sheeptest/meta/sheep1,/mnt/sheepdisk1/obj -z 1
_# sheep --port 7001 --log dir=/root/sheeptest/log/sheep2,level=debug --cluster zookeeper:localhost:2181 /root/sheeptest/meta/sheep2,/mnt/sheepdisk2/obj -z 2
_# sheep --port 7002 --log dir=/root/sheeptest/log/sheep3,level=debug --cluster zookeeper:localhost:2181 /root/sheeptest/meta/sheep3,/mnt/sheepdisk3/obj -z 3

_# dog cluster format --copies=2 -F
using backend plain store

_# dog vdi create testvdi1 1G
_# dog vdi create testvdi2 0.4G

_# dog node info
Id Size Used Avail Use%
0 982 MB 0.0 MB 982 MB 0%
1 982 MB 0.0 MB 982 MB 0%
2 982 MB 0.0 MB 982 MB 0%
Total 2.9 GB 0.0 MB 2.9 GB 0%

Total virtual image size 1.4 GB

_# dog vdi write testvdi1 < /root/sheeptest/data/test0.9G.raw
_# dog vdi write testvdi2 < /root/sheeptest/data/test0.3G.raw

_# dog cluster info -v
Cluster status: running, auto-recovery enabled
Cluster store: plain with 2 redundancy policy
Cluster vnodes strategy: auto
Cluster vnode mode: node
Cluster created at Tue Apr 26 18:46:49 2016

Epoch Time Version [Host:Port:V-Nodes,,,]
2016-04-26 18:46:49 1 [10.36.4.8:7000:128, 10.36.4.8:7001:128, 10.36.4.8:7002:128]

_# dog node info
Id Size Used Avail Use%
0 982 MB 848 MB 134 MB 86%
1 982 MB 728 MB 254 MB 74%
2 982 MB 840 MB 142 MB 85%
Total 2.9 GB 2.4 GB 529 MB 82%

Total virtual image size 1.4 GB

_# dog node kill 2

_# dog cluster info -v Cluster status: running, auto-recovery enabled
Cluster store: plain with 2 redundancy policy
Cluster vnodes strategy: auto
Cluster vnode mode: node
Cluster created at Tue Apr 26 18:46:49 2016

Epoch Time Version [Host:Port:V-Nodes,,,]
2016-04-26 18:48:14 2 [10.36.4.8:7000:128, 10.36.4.8:7001:128]
2016-04-26 18:46:49 1 [10.36.4.8:7000:128, 10.36.4.8:7001:128, 10.36.4.8:7002:128]

_# dog node info
Id Size Used Avail Use%
0 982 MB 848 MB 134 MB 86%
1 982 MB 728 MB 254 MB 74%
Total 1.9 GB 1.5 GB 387 MB 80%


sheep.log


Apr 26 18:48:14 EMERG [rw 3218] check_diskfull_possibility(1247) node IPv4 ip:10.36.4.8 port:7000 will cause disk full, stopping whole cluster
Apr 26 18:48:14 DEBUG [rw 3218] check_diskfull_possibility(1254) node IPv4 ip:10.36.4.8 port:7000 (space: 1029439488) can store required space during next recovery (1266679808)
Apr 26 18:48:14 EMERG [rw 3218] check_diskfull_possibility(1247) node IPv4 ip:10.36.4.8 port:7001 will cause disk full, stopping whole cluster
Apr 26 18:48:14 DEBUG [rw 3218] check_diskfull_possibility(1254) node IPv4 ip:10.36.4.8 port:7001 (space: 1029439488) can store required space during next recovery (1266679808)
Apr 26 18:48:14 EMERG [rw 3218] prepare_object_list(1291) canceling recovery because of disk full
Apr 26 18:48:14 EMERG [rw 3218] prepare_object_list(1292) please add a new node ASAP


@mitake
Copy link
Contributor Author

mitake commented Apr 27, 2016

@taiyo33 thanks for testing, I'm merging this PR.

@mitake mitake merged commit 8c888c0 into master Apr 27, 2016
@mitake mitake deleted the recovery-diskfull branch April 27, 2016 02:46
@tmenjo
Copy link
Contributor

tmenjo commented Apr 28, 2016

@mitake Sure. I'll make another issue about making document.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants