Disable recovery when there's not enough space #59

mitake · 2015-08-01T05:29:37Z

This is the simpler example: a cluster with 3 nodes and --copies 2.
All nodes have about 80-90% of used space.
When I kill a node, the cluster try to replicate the missing copies of the lost node but there's abviously not enough space.
I think sheepdog should behave like this:

as soon as there's not enough space on the cluster to replicate the loss of any of the nodes, recovery has to be disabled.
if a node die, the cluster is still able to work but it has to show a 'degraded' state in dog cluster info.
(This is alike mdadm showing 'clean,degraded' when a disk is missing)

dog node info
Id Size Used Avail Use%
0 4.6 GB 4.1 GB 479 MB 89%
1 5.0 GB 3.8 GB 1.1 GB 77%
2 5.0 GB 4.1 GB 894 MB 82%
Total 15 GB 12 GB 2.5 GB 83%

df -h /mnt/sheep/0
/dev/sda6 4,7G 4,2G 479M 90% /mnt/sheep/0

dog cluster info
Cluster status: running, auto-recovery enabled
Cluster created at Sat Oct 4 10:34:30 2014
Epoch Time Version
2014-10-04 10:34:30 1 [192.168.10.4:7000, 192.168.10.5:7000, 192.168.10.6:7000]
root@test004:~# dog cluster info -v
Cluster status: running, auto-recovery enabled
Cluster store: plain with 2 redundancy policy
Cluster vnode mode: node
Cluster created at Sat Oct 4 10:34:30 2014

dog node kill 2

dog node info
Id Size Used Avail Use%
0 4.6 GB 4.6 GB 2.7 MB 99%
1 5.0 GB 5.0 GB 1.5 MB 99%
Total 9.6 GB 9.6 GB 4.2 MB 99%

/var/lib/sheepdog/sheep.log
Oct 04 10:37:39 ERROR [rw 4593] prealloc(385) failed to preallocate space, No space left on device
Oct 04 10:37:39 ERROR [rw 4593] err_to_sderr(108) diskfull, oid=fd38150000005b
Oct 04 10:37:39 ALERT [rw 4593] recover_replication_object(404) cannot access any replicas of fd38150000005b at epoch 1
Oct 04 10:37:39 ALERT [rw 4593] recover_replication_object(405) clients may see old data
Oct 04 10:37:39 ERROR [rw 4593] recover_replication_object(412) can not recover oid fd38150000005b
Oct 04 10:37:39 ERROR [rw 4593] recover_object_work(576) failed to recover object fd38150000005b

dog vdu check
Server has no space for new objects

Sheepdog daemon version 0.8.0_353_g4d282d3

gadago · 2015-09-22T11:29:20Z

Hi,

I wondered if this has been progressed at all? We see the same issue here on a 3 node cluster.

Thanks,

mitake · 2015-09-22T12:02:38Z

Sorry, I'm not working on this now. I'll solve this ASAP.

gadago · 2015-09-22T12:06:29Z

No problem :)

We are looking at using sheepdog for a project and this was one of the things we noticed could be an issue in our testing.

Let me know when you have implemented the fix and happy to help test with you :)

mitake · 2015-09-22T12:07:23Z

Thanks a lot for your help!

gadago · 2015-10-26T10:02:45Z

Hi,

I just wondered if any progress has been made on this?

mitake · 2015-10-28T05:25:06Z

Hi @gadago , sorry for my late reply.

I created a branch for this problem: https://github.com/sheepdog/sheepdog/tree/recovery-diskfull

Could you check it? If you pass a new option -F to sheep, your cluster will stop itself when a recovery process can cause diskfull.

cc @sirio81 @atw-abe

sheep can corrupt its cluster by diskfull with recovery process. For avoiding this problem, this patch adds a new option -F to sheep. If this command is passed to the sheep process, every sheep process of the cluster stops itself if there is a possibility of diskfull during recovery. Fixes #59 Signed-off-by: Hitoshi Mitake <mitake.hitoshi@lab.ntt.co.jp>

sheep can corrupt its cluster by diskfull with recovery process. For avoiding this problem, this patch adds a new option -F to sheep. If this command is passed to the sheep process, every sheep process of the cluster skips recovery if there is a possibility of diskfull during recovery. Fixes #59 Signed-off-by: Hitoshi Mitake <mitake.hitoshi@lab.ntt.co.jp>

sheep can corrupt its cluster by diskfull with recovery process. For avoiding this problem, this patch adds a new option -F to dog cluster format. If this command is passed during cluster formatting, every sheep process of the cluster skips recovery if there is a possibility of diskfull during recovery. Fixes #59 Signed-off-by: Hitoshi Mitake <mitake.hitoshi@lab.ntt.co.jp>

sheep can corrupt its cluster by diskfull with recovery process. For avoiding this problem, this patch adds a new option -F to dog cluster format. If this command is passed during cluster formatting, every sheep process of the cluster skips recovery if there is a possibility of diskfull during recovery. Fixes sheepdog#59 Signed-off-by: Hitoshi Mitake <mitake.hitoshi@lab.ntt.co.jp>

sheep can corrupt its cluster by diskfull with recovery process. For avoiding this problem, this patch adds a new option -F to dog cluster format. If this command is passed during cluster formatting, every sheep process of the cluster skips recovery if there is a possibility of diskfull during recovery. Fixes #59 Signed-off-by: Hitoshi Mitake <mitake.hitoshi@lab.ntt.co.jp>

mitake added the sheep label Aug 1, 2015

mitake self-assigned this Aug 1, 2015

mitake mentioned this issue Oct 28, 2015

sheep: avoid diskfull caused by recovery process #185

Merged

mitake closed this as completed in #185 Apr 27, 2016

tmenjo mentioned this issue Apr 28, 2016

doc: document about 'dog cluster format -F' #238

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable recovery when there's not enough space #59

Disable recovery when there's not enough space #59

mitake commented Aug 1, 2015

gadago commented Sep 22, 2015

mitake commented Sep 22, 2015

gadago commented Sep 22, 2015

mitake commented Sep 22, 2015

gadago commented Oct 26, 2015

mitake commented Oct 28, 2015

Disable recovery when there's not enough space #59

Disable recovery when there's not enough space #59

Comments

mitake commented Aug 1, 2015

gadago commented Sep 22, 2015

mitake commented Sep 22, 2015

gadago commented Sep 22, 2015

mitake commented Sep 22, 2015

gadago commented Oct 26, 2015

mitake commented Oct 28, 2015