Skip to content

v3.2.0

Compare
Choose a tag to compare
@eeeebbbbrrrr eeeebbbbrrrr released this 26 Oct 23:05
· 994 commits to master since this release

This is ZomboDB v3.2.0. The primary focus of this release is improving general search and aggregation function performance, especially when the system is under high write load.

This release does necessitate that your existing USING zombodb indexes be dropped and re-created as fundamental changes to the underlying Elasticsearch index mappings and types have been made.

Upgrading from Previous Versions

The process for upgrading from a previous release should follow these steps:

Before Installing v3.2.0

First off, make sure you have exclusive access to the databases that use ZomboDB during the entire upgrade process. This is important so we can ensure no other Postgres sessions are trying to modify your tables.

  • DROP INDEX for every existing ZomboDB index in all databases. Make sure you have the CREATE INDEX statements saved somewhere, or use pg_get_indexdef('index_name'::regclass) before dropping them
  • VACUUM TABLE for every table that had a ZomboDB index. The purpose of this is to remove all dead rows from the table so they're not indexed by ZomboDB

Installing v3.2.0

Make sure you've closed any open psql (or other admin tool) sessions that might still be open from above.

  • Install the Elasticsearch plugin on every node of your cluster and restart the entire cluster
  • Install the Postgres extension on your database server
  • re-create all the USING zombodb indexes you dropped from above

What's New with v3.2.0

  • Pre-built Postgres extension binaries for Ubuntu Xenial
  • Full accounting of deleted rows in aggregation functions such as zdb_tally() and zdb_estimate_count(). Previous versions of ZomboDB would include deleted rows in aggregation results until (auto)vacuum removed them from the table
  • Significantly improved concurrent search performance when the table/index has lots of dead (yet-to-be-vacuumed) rows
  • Significantly improved aggregation function performance -- functions such as zdb_tally() and zdb_estimate_count()
  • Aggregation functions can now utilize Elasticsearch's query cache
  • A new data model for tracking Postgres' MVCC information within Elasticsearch
  • Updated zdb_index_stats and zdb_index_stats_fast to show document counts for the new xmax and aborted types
  • The need to very aggressively vacuum tables is greatly reduced

How ZomboDB Tracks Postges MVCC in Elasticsearch

Without going into great detail about how Postgres manages MVCC, this section will attempt to explain how ZomboDB maps Postgres' MVCC into Elasticsearch so that concurrent transactions always see the right rows, even when ZomboDB needs to resolve MVCC visibility wholly within Elasticsearch.

First off, ZomboDB defines three types within each Elasticsearch index -- data, xmax, and aborted.

  • data contains the actual row data from the table, along with some tracking columns, most importantly the Postgres system columns xmin and cmin (named _xmin and _cmin in the ES index) and a compact, byte-encoded representation of the Postgres heap tuple, named _zdb_encoded_tuple. ZomboDB also tracks, as a separate field, the BlockNumber from the Postgres table where the row resides (in a field named _zdb_blockno). There is one doc in this type for every row in the Postgres table.

  • xmax contains the Postgres system columns xmax and cmax (named _xmax and _cmax in the ES index). Docs for this type are only created when Postgres changes the xmax of a row. This happens to the original row during an UPDATE, and to the exact row during a DELETE. Similarly to the "data" type, "xmax" also contains a compact, byte-encoded representation named _zdb_encoded_tuple. The changing of "xmax" values is tracked through update and delete triggers automatically installed on every table with a ZomboDB index.

For both types above, the Elasticsearch _id field is set to the Postgres heap tuple item pointer, which (for ZomboDB) is in the form of BlockNumber-OffsetNumber. This is the same value (slightly reformatted) as the Postgres system column ctid.

As such, when an "xmax" doc is created, it is routed to the same shard that its corresponding "data" doc already lives. This is important for resolving visibility through a custom Lucene Query -- more later.

  • aborted contains a transient set of (64-bit epoch encoded) Postgres TransactionIds, where each is explicitly copied to each shard (10 shards means 10 copies of the doc). A doc is added to this type when an index is first modified (INSERT,UPDATE,DELETE) in a transaction, and if the transaction commits, the doc is then removed. The _id field is simply the transaction id, and there's another field named _zdb_xid that contains the same value.

When executing a query that requires ZomboDB to resolve MVCC visibility entirely within Elasticsearch (typically aggregation functions or queries with advanced constructs such as index linking or #expand()), ZomboDB's custom Lucene Query executes on each shard to decide which docs in the index are not visible to the current transaction.

First of all, the custom Lucene Query gets a list of all the transaction ids (_zdb_xid) from the "aborted" type. In general, this will be a small list.

Then, the custom Query collects all the docs from "xmax". These tell our visibility resolution logic which documents have been modified in some way, and lets it determine if the _xmax value has made the document invisible to the current transaction.

Next the Query then collects all the docs from "data" that reside on the same Postgres heap block as any of the docs found above from the "xmax" type. Depending on the number of "xmax" docs relative to the number of "data" docs, ZomboDB may decide it's just more efficient to collect all the "data" docs.

It then merges this information together to represent a full view of the "heap tuple" as it would be in the Postgres table heap at that point in time.

From there, each of the collected "data" docs, with their corresponding "xmax" values (if any), and the set of aborted transaction ids, are evaluated using this logic

(
    (xmin == myXid && cmin < myCommand && (xmax_is_null || (xmax == myXid && cmax >= myCommand)))
        ||
    (xmin_is_committed && (xmax_is_null || (xmax == myXid && cmax >= myCommand) || (xmax != myXid && !xmax_is_committed)))
)

Where myXid, myXmin, myXmax, and myCommand represent the values of the Postgres Snapshot executing the query.

In all of the above, the _zdb_encoded_tuple field is used to efficiently determine the xmin/xmax/cmin/cmax values for each doc (this has proved to be extremely efficient in situations where there are millions of "xmax" documents to evaluate).

If the combined tuple passes the test, it's visible to the transaction. If it doesn't, it isn't visible and is marked as such by the custom Lucene Query.

When a vacuum (or autovacuum) comes along, ZomboDB asks it about each _id (ctid) that it indexes, and if VACUUM says it's dead, we delete it from both the "data" and "xmax" types. Additionally, when we can prove that an _xmax is known-to-be-considered-aborted-by-all-current-and-future-transactions, we can delete every doc in "xmax" with that transaction id along with the entries from the "aborted" type. This keeps the total number of docs we need to track and evaluate for MVCC visibility in Elasticsearch to the bare minimum.

TL;DR

ZomboDB assumes that all transactions are aborted (until they actually commit), tracks xmax changes in real time via table triggers, and resolves visibility across all shards in parallel. It then cleans itself up when vacuum runs.

Quick Discussion around Vacuum

Previous versions of ZomboDB wanted autovacuum to be pretty aggressive. The reason for this was that ZomboDB wasn't necessarily very fast at resolving visibility when the table/index contained lots of dead rows. With v3.2.0, the landscape is quite a bit different.

Depending on the complexity of your query, and other factors (of course), v3.2.0 performs in millisecond/tens-of-millisecond times even when there are tens of thousands of dead rows in the index. As such, you might consider dialing up the autovacuum_naptime and autovacuum_vacuum_threshold settings in postgresql.conf. ZomboDB can now stand up to more dead rows than in the past, and it's also capable of using Elasticsearch's query cache in more situations. When queries are answered from the cache, no searching is actually done so it doesn't really matter how many dead rows are in the table.

Additionally, lessening the vacuum load on the Postgres server, which also reduces the number of Elasticsearch index refreshes ZomboDB needs to perform, can generally help all around with disk I/O. Note that it's still super important that your tables be vacuumed often for all the reasons that vacuuming is good -- v3.2.0 just gives you a little more wiggle room around it.

Monitoring What's Happening

The views zdb_index_stats_fast and zdb_index_stats are very useful for watching the state of the backing Elasticsearch index. The _fast view gives you an estimate of the number of rows in the Postgres table, so it's generally preferred as doing a "count(*)" (what the non-_fast view does) on a multi-million row tables can be really slow.

Paying attention to the xmax_count and aborted_count columns from those views will give you an idea of how much "garbage" is still in the backing Elasticsearch index. Under concurrent load, with autovacuum configured, you should see these values rise during writes and drop when autovacuum runs.

Note that the aborted_count column can be divided by the number of shards in your index to get the actual count of deleted transaction ids ZomboDB is tracking. As mentioned above, aborted transaction ids are copied to every shard.

When Postgres is idle (in terms of writes), you should see both of these columns reach zero. Because the "xmax" and "aborted" docs are removed during the "cleanup" phase of vacuum, it may be necessary to execute an explicit VACUUM table before you'll see the columns actually hit zero.

Reindexing Sucks, WTH!?

Yeah, sorry. The new structure v3.2.0 uses to track and resolve MVCC visibility, which is really the central feature of ZomboDB, is really solid and hopefully future-proof. The hope and intent is that it'll be quite a long time before ZomboDB asks its users to reindex their data again.

Despite the need to reindex, I highly recommend that everyone upgrade, especially if you're on v3.1.13-3.1.15. v3.1.13 introduced some corruption bugs, and while v3.1.15 fixed them, it made the entire thing much slower.

v3.2.0 gets all this stuff right.