Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raiden must quit when disk is full #675

Closed
2 tasks
hackaugusto opened this issue Jun 22, 2017 · 7 comments
Closed
2 tasks

Raiden must quit when disk is full #675

hackaugusto opened this issue Jun 22, 2017 · 7 comments

Comments

@hackaugusto
Copy link
Contributor

hackaugusto commented Jun 22, 2017

Problem Definition

To guarantee resilience all WAL write operations must succeed, if the log backend fails (e.g. sqlite) under certain circumstances (out-of-memory, disk-full, etc.) the process cannot make progress in any form, if the node tries to progress it will lose important data. To avoid further errors the node must quit immediately.

Solution

Treat SQLite exceptions as final and quit the process.

Tasklist

  • Write a test for a full disk scenario.
  • Add SQLite exceptions into the gevent hub SYSTEM_ERROR
@hackaugusto
Copy link
Contributor Author

hackaugusto commented Jun 22, 2017

Ideally we could pre-commit disk space and memory for safe operation. Edit: Actually, thinking about it, it is simpler to just do what the issue says and handle the sqlite exceptions.

@carllin
Copy link
Contributor

carllin commented Apr 26, 2018

I think I can do this :)

@LefterisJP
Copy link
Contributor

LefterisJP commented Apr 27, 2018

@carllin sure go ahead.

@karlb
Copy link
Contributor

karlb commented Nov 15, 2018

I managed to provoke an sqlite3.OperationalError by temporarily limiting the processes allowed file size via resource.setrlimit.

Does anyone have an idea how the real test might look? It seems like I need to run a complete MatrixRunner or UDPRunner to test the real error handling part, but that would hardly be a unit test, anymore.

I also think that we already handle the case correctly with the general error handler at https://github.com/raiden-network/raiden/blob/master/raiden/ui/runners.py#L186 . Or is there anything else we want to do in that case? sqlite3.OperationalError: disk I/O error is reasonably descriptive for a rather rare error condition.

@hackaugusto
Copy link
Contributor Author

hackaugusto commented Nov 15, 2018

It seems like I need to run a complete MatrixRunner or UDPRunner to test the real error handling part, but that would hardly be a unit test, anymore.

I don't think it's possible to write this as a unit test, since the requirement is to exit the application.

Does anyone have an idea how the real test might look?

what about a fork + setrlimit + spawn ?

@LefterisJP
Copy link
Contributor

LefterisJP commented Nov 15, 2018

sqlite3.OperationalError: disk I/O error is reasonably descriptive for a rather rare error condition.

Yeah I guess this should be fine -- covers not only lack of disk space but corrupt sectors or what not.

@karlb
Copy link
Contributor

karlb commented Nov 16, 2018

Approach 1: spawn a new raiden process via command line

Not only do we need to set up geth, keystore and the raiden service but also the required contracts for raiden to start successfully. We'd mostly need to rebuild the whole smoke test. That's overly complicated and redudant.

Approach 2: Build on existing raiden_network fixture

This is easy to get running, but it does not invoke the real error handling from MatrixRunner, so we're not actually testing anything meaningful here.

Another approach could be to wrap the smoketest with a resource limit and use that to test this error condition. But @LefterisJP and I agreed that benefit/effort ratio is not sufficient to go on with this.

We at least learned:

  1. that a sqlite3.OperationalError will get thrown when we're out of disk space
  2. the current implementation should handle this as we'd like it to be due to https://github.com/raiden-network/raiden/blob/master/raiden/ui/runners.py#L186

@karlb karlb closed this as completed Nov 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants