Skip to content
Vladimir Krivopalov edited this page Mar 7, 2018 · 7 revisions

Overview

The cassandra commit log is even in its original form pretty simple. It consists of a number of files with serialized mutation operations in it, in the order they are appended. (Note: stock cassandra orders write explicitly before touching the commit log; See KeySpace.writeOrder) Serialized format is managed by the mutation itself, only meta data and CRC is handled by the log.

Three aspects are maintained:

  1. Commit log segment size. I assume both for file management sanity, but also due to the fact that Cassandra uses mapped byte buffers for the file IO. Each segment allocates the space required to write a mutation via CAS.
  2. Write order. This is how the code calls it, but in reality it is just markers to ensure all write operations are completed before finalizing (flushing) a segment part.
  3. Eventual consistency. Since the writes are done via mapped io, a background thread occasionally flushes the outstanding pages to disk. Depending on model, appending to the commit log will either wait for flush to reach the position allocated ("batch mode"), or "wait if the last flush was too long ago".

There is some code complexity in the classes, but they all relate to multi-threading and the above points. Assuming that we use a commit log per reactor (i.e. cpu/thread), the above becomes fairly trivial.

Lifecycle

Commmit log segments are managed from the CommitLogSegmentManager; Each block is handed an ID number (incremented counter). The ID also serves as an age counter. Allocating from (i.e. writing to) the segment returns a ReplayPosition, which is the segment ID + file offset. The segment in turn keeps a map of column family UUID->highest allocated position. For each column family in a memtable a "last replaypos" is kept. Once the memtable is flushed, this position is reported to the commit log, and if all dirty UUID:s are cleared by this (comparing UUID+pos to dirty set), the segment is deleted/recycled.

Scylla

All the IO can be done via simple file ops, preferably pwrite.

Addressing the above points:

  1. Segment size. No issue here. Simply keep count of file size, and max. If writing to the segment exceeds this, toss the old, add new. No contention assuming we deal with the below:
  2. Write order. Obviously tied to consistency, but assuming we have outstanding operations (non-batch mode using cassandra terms), as far as I can tell, the only culprit is keeping the file open and section X unflushed until all are done. Which seems like a superb job for a simple shared pointer + RAII behaviour. Once write ops finish, either the whole file or a part of it (as defined by the highest written to position) can be explicitly synced.
  3. Eventual consistency. If running "batch mode", this is trivial. Just then()-continue the write (with optional sync/force?). The "periodic" behaviour would simply conditionally to the same if too much time as passed since last finished write op + sync. Obviously the segment needs to keep a time stamp counter. But again, since we have no contention, it is all trivial. Biggest question is probably how to (if even) do a (partial) sync of file buffers -> disk (async).
Clone this wiki locally