data loss with --no-direct-io #1703

wojons · 2013-11-25T21:30:00Z

Lost power a few hours after creating a small test table with maybe 5 rows in it. When i restarted the server the table was empty. I MAY have lost 100k records from a different table. Not sure how often fsync is called with --no-direct-io and if it just waits for linux to force call fsync that could be the issue. I will update the ticket when i get back to my workstation with the mount output but should just be default ext4 with normal linuxmint (ubuntu) defaults

danielmewes · 2013-11-25T21:33:51Z

As I understand, our syncing scheme doesn't rely on direct i/o at all. @srh: Can you confirm that?
So --no-direct-io should behave the same in this respect as the default direct i/o.

I wonder if the table could appear empty if the file for the namespace in the rethinkdb data directory got lost?

coffeemug · 2013-11-25T22:34:30Z

Could someone ping @srh in person about this when you get the chance? (he tends to read github issues less frequently when he's in the middle of a big project)

srh · 2013-11-26T23:43:25Z

The --no-direct-io option is indeed independent of syncing. Syncing writes is affected by hard durability / soft durability options and noreply options. fsync is called in any case that a write to disk happens -- what the hard/soft/noreply options affect is how frequently we write to disk.

coffeemug · 2013-11-30T11:08:58Z

I don't think there is enough actionable data for us here. Is everyone ok with me moving this issue into backlog so we can do further testing when time permits?

mlucy · 2013-11-30T19:19:21Z

I feel like we should create a new issue for writing automated power-failure tests (maybe by kill -9ing a VM running RethinkDB?). We might be able to reproduce that way.

srh · 2013-12-03T19:54:03Z

The problem could be that we call fdatasync, but we don't call fsync in such a way that makes sure the file's actually present in the directory. Then the file doesn't exist upon startup after the power failure, but the metadata says the table exists, and (perhaps) silently creates the table when it can't find a file for the table.

danielmewes · 2013-12-03T20:00:22Z

It seems we would also have to call fsync on the directory in which we create rethinkdb_data in the case of rethinkdb create.

wojons · 2013-12-04T11:33:21Z

@mlucy i think that would be a great idea running in the cloud these days you dont know how and when your server will be stopped and if there will be some sort of automatic migration mid processes. Also I would think it would be useful to see how it handles other types of failures like random parts of memeory falling out of sync or something like that simulating kernel panics blah blah blah....

@danielmewes and @srh sounds like you guys have figured out the problem

danielmewes · 2013-12-04T19:09:30Z

@srh: Ok if I take this?

danielmewes · 2013-12-04T21:26:23Z

A fix for the possible cause of this is in code review 1070 by @srh.

danielmewes · 2013-12-05T02:33:59Z

The fix has been merged into next as of 5ed5667 and cherry-picked into v1.11.x as of 85ef29d.

danielmewes · 2013-12-05T02:34:53Z

@wojons: The fix will be included in the next release of RethinkDB, whether it is a point release (1.11.2) or a major one (1.12).

wojons · 2013-12-05T21:27:50Z

thanks @danielmewes

AtnNn · 2013-12-06T23:04:57Z

The fix has been released in RethinkDB 1.11.2

ghost assigned danielmewes Dec 4, 2013

danielmewes closed this as completed Dec 5, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data loss with --no-direct-io #1703

data loss with --no-direct-io #1703

wojons commented Nov 25, 2013

danielmewes commented Nov 25, 2013

coffeemug commented Nov 25, 2013

srh commented Nov 26, 2013

coffeemug commented Nov 30, 2013

mlucy commented Nov 30, 2013

srh commented Dec 3, 2013

danielmewes commented Dec 3, 2013

wojons commented Dec 4, 2013

danielmewes commented Dec 4, 2013

danielmewes commented Dec 4, 2013

danielmewes commented Dec 5, 2013

danielmewes commented Dec 5, 2013

wojons commented Dec 5, 2013

AtnNn commented Dec 6, 2013

data loss with --no-direct-io #1703

data loss with --no-direct-io #1703

Comments

wojons commented Nov 25, 2013

danielmewes commented Nov 25, 2013

coffeemug commented Nov 25, 2013

srh commented Nov 26, 2013

coffeemug commented Nov 30, 2013

mlucy commented Nov 30, 2013

srh commented Dec 3, 2013

danielmewes commented Dec 3, 2013

wojons commented Dec 4, 2013

danielmewes commented Dec 4, 2013

danielmewes commented Dec 4, 2013

danielmewes commented Dec 5, 2013

danielmewes commented Dec 5, 2013

wojons commented Dec 5, 2013

AtnNn commented Dec 6, 2013