Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rhizome server process to avoid lock errors and reduce servald latency #1

Open
quixotique opened this issue Aug 22, 2012 · 0 comments
Assignees
Milestone

Comments

@quixotique
Copy link
Member

The SQLite3 library locks all database operations to stop concurrent processes from corrupting the database. See http://www.sqlite.org/lockingv3.html. The scheme allows many concurrent readers or one single writer at a time.

The current architecture of Rhizome as implemented in servald allows more than one process to directly access the Rhizome database, which can produce lock conflicts. These conflicts cause the SQLite queries to fail immediately: the sqlite3_step() function returns MYSQL_BUSY.

On immediate consequence of these lock errors is in sending/receiving MeshMS messages. An incoming MeshMS message log (Rhizome bundle) causes the Batphone app to fork a thread which accesses the Rhizome database directly via calls to the servald command line operations rhizome list, rhizome extract manifest, rhizome extract file, and rhizome add file. Sometimes these operations fail because of a database lock error.

As a side issue (issue #3), these errors are not always reported back to the Java code, which continues on assuming the operation was successful. So a retry scheme cannot be implemented by the Batphone app.

The main issue is that MeshMS reception (acknowledgement) and sending should simply not be allowed fail because of database lock errors. The architecture must be made to deal with database concurrency issues completely and correctly.

This can be dealt with in three stages:

  1. A partial fix to reduce the impact of database lock errors and keep MeshMS reliable enough for demo purposes. Issue Recover from Rhizome database lock errors using sleep-retry #2 introduces a low-level sleep-retry mechanism into all database accesses that should avoid the majority of lock errors.
  2. Issue Report Rhizome database errors to command line caller #3 fixes the reporting of database lock errors to the Rhizome command-line operations that invoke them, so that Batphone Java code can detect and deal with the failure.
  3. The substance of this issue is to change the existing servald architecture to fix the issue properly, as described below.

All Rhizome database operations ought to be performed by a single Rhizome server process that should be a fork(2) of the servald process.

The Rhizome server will present a simple request-response interface to all other components of the Serval Mesh product, and all Rhizome database operations will be performed exclusively by that process, thus eliminating the risk of database lock errors under normal circumstances.

The Rhizome server can safely have very high latency if needed, and this will not affect the low-latency services offered by servald. If servald wishes to perform a Rhizome store operation, for example storing a paylod that was just received via HTTP, then it puts the data into request parameters (and optionally files in external storage), and sends a request to the Rhizome server, using its asynchronous i/o mechanism to wait for the server to accept and then complete the request. All command-line rhizome operations will do the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant