SSTable compaction and compaction strategies
About this document
This document starts by explaining what is compaction, and why Cassandra uses it. It then discusses the mechanism of the compaction (how to compact, where to run it, etc.), followed by a discussion of the policy of the compaction (i.e., when to compact, and what).
In Cassandra, SSTables are immutable: Mutations, adding new data or overriding or deleting old data, are inserted into a memory table and this memory table is periodically dumped to disk - each time to a fresh new SSTable. After a while, we have many separate SSTables. These SSTables might contain outdated data - e.g., different SSTables might contain both an old value and new value of the same cell, or an old value for a cell later deleted. That is fine, as Cassandra uses timestamps on each value or deletion to figure out which is the most recent value. However, it is a waste of disk space. It also slows down reads: different SSTables can hold different columns of the same row, so a query might need to read from multiple SSTables to compose its result.
For these reasons, compaction is needed. Compaction is the process of reading several SSTables and outputting one SSTable containing the merged, most recent, information.
The technique of keeping sorted files and merging them is well-known and called Log-Structured Merge-tree (LSM). The Wikipedia article describes this technique as having been invented in 1996, but the earliest popular application I'm aware was in the Lucene search engine in 1999 - the merging technique is what really set Lucene apart from its contemporary search engine libraries, because its indexing was both high-performance and up-to-the-minute (containing all the latest information).
Efficiency of compaction
Compaction is efficient in two senses:
- It does not involve reading the entire SSTables into memory.
- It requires mostly sequential disk reads.
Both properties are the result of the choice to keep in each SSTable the partitions sorted by their key (hence their name - sorted string tables), making merging them into a new sorted SSTable a simple and efficient operation (see merge). We have an example implementation of the merge algorithm as a C++ template in OSv nway_merger.hh.
The merge algorithm described above requires the ability to read several SSTable partitions with the same partition key, and then merge them correctly (taking the latest value or deletion of each column, correctly handling range tombstones, cell expiration times and tombstone gc times, etc.).
We already have this building block ready for our in-memory mutation_partition class (
mutation_partition::apply), so the easiest approach to start with is to:
- read each SSTable partition into a mutation_partition,
- merge several of those (who share the same partition key) from different SSTables,
- convert the result back to SSTable format. However, in the long run, merging will be more efficient if we could merge the SSTable format directly, instead of converting formats back and forth.
If our merge process goes through mutation_partition objects, and given that our in-memory data ("memtable") also uses these objects, one optimization we can use fairly easily (but with modest expected gain) is to avoid writing the last memtable to disk before compaction - and compact it directly from memory.
Deleting the old SSTables
While a merge of several SSTables is ongoing, the request path continues to read the old SSTables. Ideally, the old SSTables would be deleted as soon as the merge is done, but we must not delete an SSTable that still has in-progress reads.
One way we could handle this is using shared_ptr's: We'll have an in-memory object representing each SSTable, reference counted by using in through a shared_ptr. Additionally, we have a list of the live sstable shared_ptrs, keeping them alive. A request starts by copying all the live shared_ptrs (i.e., increasing these sstables' reference count) - this loop is atomic because it doesn't block (only one core accesses a set of SSTables). This keeps these sstables objects alive during this request's processing. The sstable object's destructor, eventually called when it has no more references, will finally delete the on-disk file. Of course, we musn't delete the list-of-live-sstable-shared_ptrs on shutdown, or we'll end up deleting all the on-disk sstables.
There is the possibility that the server shuts down or crashes while some already-merged SSTables are still in use (and therefore not deleted) - and we don't want these SSTables to be "resurrected" when the server comes up again, because we already have the merged SStable that replaces them. There are various ways to solve this problem (TODO: What does Cassandra use?):
- Each SSTable has a number ("generation"). Each SSTable can list the generations it superseded, and when a superseded SSTable is noticed during startup time, it is deleted.
- Delete the sstable as soon as it is not needed, using the fact that on Linux, an open file is still readable even if deleted. This solution is fragile and unfriendly to the file system.
- Keep a persistent list (on-disk file) of live SSTables. This is the option which Apache Lucene uses.
New sstables are written to a special file name (Cassandra uses the word "tmp" in the file name), and renamed to their regular name after completely written. This ensures that after a crash, we don't pick up a half-written sstable. Theoretically, we could even continue the merge around where it left off (although this feature can definitely wait).
Cassandra has parameter compaction_throughput_mb_per_sec (default 16MB/s) that limits the disk throughput used by compaction. We could have the same parameter.
Cassandra does compaction in a background thread, or several of those (e.g., leveled compaction, explained below, can happen in parallel). The OS divides the available cores between these background threads, and the request threads, so everything happens automatically albeit inefficiently.
In Scylla, on the other hand, we have two design options to choose from:
- Per-cpu compaction: In Scylla, each CPU is responsible for its own shard (collection of sstables). In this option, each CPU also does the compaction of its own sstables.
- Dedicated core: In this option, one CPU (or several CPUs) is dedicated to the compaction process.
We will go with the per-cpu option, as it fits the Scylla design much better. Since only one core can answer for a shard, if this core is busy doing some long merge, the answer will be delayed. So we must divide the compaction work into very small tasks, and make sure only a small amount of work is done on the merge before continuing the event loop. We need to figure out how small this "small amount" can be to 1. not increase latency by a significant amount, 2. make sure the merging progresses is as fast as we want it to, and 3. not waste too much performance on "context switching" between the tiny tasks.
Memory use during compaction
Compaction uses some temporary memory, and of course our goal is to limit this use.
Initially - while (as suggested above) we convert SSTable partitions to mutation_partition instead of operating on the original SSTable format - this means keeping a whole mutation from each compacted SSTable in memory, i.e., a total memory requirement of the merged partition length.
But for non-huge partitions we'll probably read a fixed-sized buffer instead of just one partition. For spinning disks, this buffer should be fairly long to reduce the number of seeks, but for SSDs it can be kept low. We'll need one such buffer for each compacted SSTable, so a single compaction which typically involves 4-14 files will need 14 of them, so it makes to keep the buffer small on an SSD.
If the memory use of these buffers is significant, it also means it's not desirable to drag on a single compaction for very long, taking up this buffer space for the whole duration. Because we want a limit on the total (across the whole system) disk bandwidth used for compaction, it may make sense to limit the number of compactions ongoing in parallel: If we start two compactions in parallel, and limit their combined disk throughput, each will take twice as long to complete - and therefore keep its memory buffers allocated for twice as long. Of course, there are other reasons why parallel compaction is desirable so it should not always be avoided (e.g., if we have more than one disk, or if in the per-cpu compaction approach, the cpus are busy).
Disk use during compaction
Compaction can also require a significant amount of temporary disk space (this is true for compaction strategies other than the leveled compaction - see details below). This is another reason to want to limit the number of parallel compactions.
Caching after compaction
Because Cassandra uses the OS's page cache for caching reads from the SSTable, it needs to take care to pre-read various partitions of the merged sstable based on the "hotness" of the pre-merge partitions.
Things are simpler (and more elegant) in Scylla which bypasses the OS page cache, and keeps (or plans to keep....) a row cache. Entries in the row cache are not changed on invalidated by the compaction operation - they already contain fully merged rows, and don't need to be evacuated or reread if the backing files change.
When to start using compacted tables
While a compaction is in progress, if read requests come in the most straight-forward thing to do is to continue serving them from the old sstables. The more elaborate thing we could do is to start using the partially written sstable for the key range it already contains (and use the old sstables for keys outside this range).
The second approach (use partially-written sstables) has a theoretical advantage when the compaction takes a long time, e.g., a very large sstable (see the size-tiered compaction strategy below) or rate-limited background compaction. If a request comes in that prior to the compaction would have required reading from multiple sstables, it is possible that the data it needs is already available in the partially-written sstable, and can be satisfied with just one read. This will not help really hot requests (where the row cache would contain that response anyway) but may help a situation where we have many different cold requests.
It isn't clear if the benefits of the second approach (use partially-written sstables) are worth the extra hassle. Cassandra started with the first (simple) option, then in https://issues.apache.org/jira/browse/CASSANDRA-6916 switched to the second option, and in https://issues.apache.org/jira/browse/CASSANDRA-8833 is considering switching back. For us, the appeal of the second approach is less than it was in Cassandra, because one of its benefits is helping warm the page cache, which is not relevant for us because we use a row cache, not the OS's page cache (see discussion above).
So far we've discussed why to compact several SSTables into one, and how to do it. The next question we need to ask is what policy (or strategy) to use - when to compact, and which sstables to compact.
Cassandra implements three "compaction strategies":
This is Cassandra's oldest (dating back to the original Google BigTable paper) and default compaction strategy. The idea here is to trigger compaction when we have enough (four by default) similarly sized SSTables. These are merged together, to form one larger SSTables. Later, when several large SSTables have accumulated, they will be merged to form one even-larger SSTable - and so on.
In this manner, we have several size tiers (small sstables, large sstables, even-larger sstables, etc.) and in each tier we have roughly the same number of files; When one tier is full, we merge all its tables to one table in the next tier.
In more details (see
Cassandra defines a "Bucket" as a collection of similar-sized sstables, and the SSTables are grouped to buckets. The "small" bucket contains SStables smaller than minSSTableSize (by default, DEFAULT_MIN_SSTABLE_SIZE = 50 MB). SStables are groups into buckets of similarly-sized files, such file sizes in the bucket are between 0.5 and 1.5 of the average size in the bucket (DEFAULT_BUCKET_LOW, DEFAULT_BUCKET_HIGH, respectively). Very "cold" (hardly read) SSTables are omitted from consideration - they are not worth bothering with since the main purpose of compaction is to improve read performance (DEFAULT_COLD_READS_TO_OMIT = 0.05). At least 4, and no more than 32, SSTables in the same bucket are considered for compaction (this 4 and 32 are per-table parameters: getMinimumCompactionThreshold(), getMaximumCompactionThreshold()).
Why was SizeTieredCompactionStrategy not "good enough" and LeveledCompactionStrategy was developed (see below)?
As noted in http://www.datastax.com/dev/blog/when-to-use-leveled-compaction, SizeTieredCompactionStrategy is great if rows are written once and never modified (or written a few times and then not modified again). In that case, each row will eventually end up being written as a whole to one compacted SSTable, and reads are efficient. But, if our use case involves continuously modifying existing rows, with size-tiered compaction, each row will always be split across several SSTables, making reads slow. This doesn't happen in leveled compaction (see below).
Also http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra points out two other disadvantages of size-tiered compaction, even if the workload is write-mostly:
- Obsolete data (overwritten or deleted columns) in a very large SSTable will stay behind for a long time, and waste a lot of space, for a long time until finally merged.
- Compaction requires a lot of temporary space: In the worst case, we need to merge all existing SSTables into one, so we need half the disk to be empty to write the output file and only later can delete the old SSTables.
With leveled compaction, instead of potentially huge SSTables we use small, fixed-size (by default 160 MB) SSTables divided into different "levels". The technique works as follows:
- New SSTables (dumped from memtables) are created in "Level 0".
- The other levels (all except level 0) are each a run of SSTables, of exponentially increasing size: "Level 1" is a run of 10 SSTables (of 160 MB each), "Level 2" is a run of 100 SSTables (160 MB each), etc.
- A run of SSTables is an LSM terminology for a set of SSTables with non-overlapping key ranges. In other words, each SSTable has a range (between its first key and its last key - remember that the keys are sorted), and these ranges are disjoint between the different SSTables at a single level.
- A run can be thought of as a split-up huge SSTable. The benefit of a run is that while a huge SSTable must be rewritten as a whole on modification, in a run we can modify only parts of it (individual sstables) while keeping the disjoint key requirement. This is what leveled compaction does.
- When we have enough (e.g., 4) sstables in Level 0, we compact them with all 10 sstables in Level 1. This compaction works like this: We read in parallel the 4 sstables in level 0 and 10 in level 1 and write new sstables for level 1 (replacing the 10 old ones we've compacted). We don't create one large sstable - rather, we write one sstable and when we reach the size limit (160 MB), we start a new sstable. Because the merge happens on sorted keys, the new sstables we generate are a run, i.e,, have non-overlapping key ranges.
- Now, after the compaction of level 0 into level 1, it is possible that we have more than the desired number 10 of sstables in level1. We pick one excess sstable from level 1, and compact it into level 2:
- We take one sstable from level 1 (this sstable will be deleted after the compaction)
- We look at this sstable's key range, and find all sstables in level 2 which overlap with it.
- Typically, there are about 12 of these (the level 1 sstable spans roughly 1/10th of the keys, while each level 2 sstable spans roughtly 1/100th of the keys, so 10 level-2 sstables will overlap the level-1 sstable's range, plus two more on the edges).
- As before, we compact the one sstable from level 1 and the 12 sstables from level 2 and replace all of those with new sstables in level 2.
- After this compaction of level 1 into level 2, now we can have excess sstables in level 2 so we merge them into level 3. Again, one sstable from level 2 will need to be compacted with around 10 sstables from level 3.
With the leveled compaction strategy, sstable reads are efficient: First, it's important to note that the great number of small sstables doesn't mean we need to look up a key in that many sstables, because we know the sstables in each level have disjoint ranges, so we only need to look in one sstable in each level. But in the more typical case, we just need to read one sstable. http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra claims that "90% of all reads will be satisfied from a single sstable (assuming nearly-uniform row size)." (TODO: understand why). The other factors making this compaction strategy efficient is that at most 10% of space will be wasted by obsolete rows (TODO: understand why), and only enough space for ~10x the small sstable size needs to be reserved for temporary use by compaction.
- CONTINUE HERE: the downside of this method is two times more io on writes (TODO: understand why) so it is not good for write-new-data-mostly workloads.
In more details (see
TODO: look at the code. look at configuration parameters (table subproperty min_threshold).
This was the last compaction strategy added to Cassandra (in version 2.1), designed for time series data. It was written by Björn Hegerfors, who describes the idea and its implementation very nicely in https://labs.spotify.com/2014/12/18/date-tiered-compaction/. See also Datastax's explanations of this feature: http://www.datastax.com/dev/blog/datetieredcompactionstrategy and http://www.datastax.com/dev/blog/dtcs-notes-from-the-field.
A time series in Cassandra uses the time of a data point as the clustering key. In a time-series use case, we see some common features:
- Clustering key and write time are correlated.
- Data is added in time order. Only few out-of-order writes, typically rearranged by just a few seconds.
- Data is only deleted through TTL or by deleting an entire partition.
- The rate at which data is written is nearly constant.
- A query on a time series is usually a range query on a given partition; The most common query is of the form "values from the last hour/day/week".
Cassandra remembers (in memory) the minimum and maximum clustering keys of each sstable (CASSANDRA-5514). So because of the above assumptions, it can easily know which sstable (before compaction) contains data relevant to a particular requested time range. However, both Size-Tiered and Leveled compaction destroy this neat ordering, because they usually merge old and new data in the same output sstable, making it no longer possible to rule out a whole "old" sstable from a search for "new" data. Moreover, while the other two compaction strategies always aim to move all the rows of a partition into the same sstable, for time series this is less ideal, because a query for new data will actually be more efficient if just the new data is in a separate sstable, and the query usually does not need to read huge partitions containing all the data ever written for a partition.
So the date-tiered compaction strategy sorts the sstables by time (an sstable's time is its minimum "timestamp", assuming these are in microseconds - see
timestamp_resolution sup-property). It then compacts adjacent (time-wise) sstables. The result are sstables whose sizes increase exponentially as they grow older. For example, at some point we can have the last minute of data in one sstable (by default,
base_time_seconds = 60), another minute before that in another sstable, then the 4 minutes before that in one sstable, then the 4 minutes before that, then an sstable of the 16 minutes before that, and so on. This structure can easily be maintained by compaction, very similar to what we did in size-tierd compaction: When we have 4 (the default value for
min_threshold) small (one-minute) consecutive sstables, we compact them into one 4-minute sstable. When we have 4 of those bigger sstables one after another (time-wise), we merge them into a 16-minute sstable, and so on.
Antique sstables older than
max_sstable_age_days (by default 365 days) are not compacted any more - doing those compactions will not be useful for most queries, will be very slow, and require huge amounts of temporary disk space.
Date-tiered compaction also helps in making the TTL (expiration times on data) more efficient. Cassandra keeps track of each sstable's most recent expiration time (see CASSANDRA-5228), and when we notice the oldest sstable's most recent expiration time has passed, we can drop the entire sstable, without bothering to check the expiration time of the individual cells it contains. This is especially useful when all data is inserted with the same TTL.
In more details (see
TODO: make sure I covered everything in http://docs.datastax.com/en/cql/3.1/cql/cql_reference/compactSubprop.html
Cassandra also has a feature called "Major compaction" which, according to http://docs.datastax.com/en/cassandra/2.0/cassandra/tools/toolsCompact.html, is compaction with two unrelated properties:
- We do compaction for all the tables in the keyspace (not just compacting where it's needed).
- For each table, all its SSTables are compacted into a single SSTable.
Such "major compaction" never happens automatically - it can only be started manually via
nodetool. It is not normally something which should be done. In many cases (e.g., Date-Tired compaction strategy) it is downright counter-productive. Looking on the Internet, both users and Datastax recommend against using it. So we probably shouldn't implement it - at least not initially.