v3.2.0
This is ZomboDB v3.2.0. The primary focus of this release is improving general search and aggregation function performance, especially when the system is under high write load.
This release does necessitate that your existing USING zombodb
indexes be dropped and re-created as fundamental changes to the underlying Elasticsearch index mappings and types have been made.
Upgrading from Previous Versions
The process for upgrading from a previous release should follow these steps:
Before Installing v3.2.0
First off, make sure you have exclusive access to the databases that use ZomboDB during the entire upgrade process. This is important so we can ensure no other Postgres sessions are trying to modify your tables.
DROP INDEX
for every existing ZomboDB index in all databases. Make sure you have theCREATE INDEX
statements saved somewhere, or usepg_get_indexdef('index_name'::regclass)
before dropping themVACUUM TABLE
for every table that had a ZomboDB index. The purpose of this is to remove all dead rows from the table so they're not indexed by ZomboDB
Installing v3.2.0
Make sure you've closed any open psql
(or other admin tool) sessions that might still be open from above.
- Install the Elasticsearch plugin on every node of your cluster and restart the entire cluster
- Install the Postgres extension on your database server
- re-create all the
USING zombodb
indexes you dropped from above
What's New with v3.2.0
- Pre-built Postgres extension binaries for Ubuntu Xenial
- Full accounting of deleted rows in aggregation functions such as
zdb_tally()
andzdb_estimate_count()
. Previous versions of ZomboDB would include deleted rows in aggregation results until (auto)vacuum removed them from the table - Significantly improved concurrent search performance when the table/index has lots of dead (yet-to-be-vacuumed) rows
- Significantly improved aggregation function performance -- functions such as
zdb_tally()
andzdb_estimate_count()
- Aggregation functions can now utilize Elasticsearch's query cache
- A new data model for tracking Postgres' MVCC information within Elasticsearch
- Updated
zdb_index_stats
andzdb_index_stats_fast
to show document counts for the newxmax
andaborted
types - The need to very aggressively vacuum tables is greatly reduced
How ZomboDB Tracks Postges MVCC in Elasticsearch
Without going into great detail about how Postgres manages MVCC, this section will attempt to explain how ZomboDB maps Postgres' MVCC into Elasticsearch so that concurrent transactions always see the right rows, even when ZomboDB needs to resolve MVCC visibility wholly within Elasticsearch.
First off, ZomboDB defines three types within each Elasticsearch index -- data
, xmax
, and aborted
.
-
data
contains the actual row data from the table, along with some tracking columns, most importantly the Postgres system columnsxmin
andcmin
(named_xmin
and_cmin
in the ES index) and a compact, byte-encoded representation of the Postgres heap tuple, named_zdb_encoded_tuple
. ZomboDB also tracks, as a separate field, the BlockNumber from the Postgres table where the row resides (in a field named_zdb_blockno
). There is one doc in this type for every row in the Postgres table. -
xmax
contains the Postgres system columnsxmax
andcmax
(named_xmax
and_cmax
in the ES index). Docs for this type are only created when Postgres changes thexmax
of a row. This happens to the original row during an UPDATE, and to the exact row during a DELETE. Similarly to the "data" type, "xmax" also contains a compact, byte-encoded representation named_zdb_encoded_tuple
. The changing of "xmax" values is tracked through update and delete triggers automatically installed on every table with a ZomboDB index.
For both types above, the Elasticsearch _id
field is set to the Postgres heap tuple item pointer, which (for ZomboDB) is in the form of BlockNumber-OffsetNumber
. This is the same value (slightly reformatted) as the Postgres system column ctid
.
As such, when an "xmax" doc is created, it is routed to the same shard that its corresponding "data" doc already lives. This is important for resolving visibility through a custom Lucene Query -- more later.
aborted
contains a transient set of (64-bit epoch encoded) Postgres TransactionIds, where each is explicitly copied to each shard (10 shards means 10 copies of the doc). A doc is added to this type when an index is first modified (INSERT,UPDATE,DELETE) in a transaction, and if the transaction commits, the doc is then removed. The_id
field is simply the transaction id, and there's another field named_zdb_xid
that contains the same value.
When executing a query that requires ZomboDB to resolve MVCC visibility entirely within Elasticsearch (typically aggregation functions or queries with advanced constructs such as index linking or #expand()
), ZomboDB's custom Lucene Query executes on each shard to decide which docs in the index are not visible to the current transaction.
First of all, the custom Lucene Query gets a list of all the transaction ids (_zdb_xid
) from the "aborted" type. In general, this will be a small list.
Then, the custom Query collects all the docs from "xmax". These tell our visibility resolution logic which documents have been modified in some way, and lets it determine if the _xmax
value has made the document invisible to the current transaction.
Next the Query then collects all the docs from "data" that reside on the same Postgres heap block as any of the docs found above from the "xmax" type. Depending on the number of "xmax" docs relative to the number of "data" docs, ZomboDB may decide it's just more efficient to collect all the "data" docs.
It then merges this information together to represent a full view of the "heap tuple" as it would be in the Postgres table heap at that point in time.
From there, each of the collected "data" docs, with their corresponding "xmax" values (if any), and the set of aborted transaction ids, are evaluated using this logic
(
(xmin == myXid && cmin < myCommand && (xmax_is_null || (xmax == myXid && cmax >= myCommand)))
||
(xmin_is_committed && (xmax_is_null || (xmax == myXid && cmax >= myCommand) || (xmax != myXid && !xmax_is_committed)))
)
Where myXid
, myXmin
, myXmax
, and myCommand
represent the values of the Postgres Snapshot executing the query.
In all of the above, the _zdb_encoded_tuple
field is used to efficiently determine the xmin/xmax/cmin/cmax values for each doc (this has proved to be extremely efficient in situations where there are millions of "xmax" documents to evaluate).
If the combined tuple passes the test, it's visible to the transaction. If it doesn't, it isn't visible and is marked as such by the custom Lucene Query.
When a vacuum (or autovacuum) comes along, ZomboDB asks it about each _id
(ctid) that it indexes, and if VACUUM says it's dead, we delete it from both the "data" and "xmax" types. Additionally, when we can prove that an _xmax
is known-to-be-considered-aborted-by-all-current-and-future-transactions, we can delete every doc in "xmax" with that transaction id along with the entries from the "aborted" type. This keeps the total number of docs we need to track and evaluate for MVCC visibility in Elasticsearch to the bare minimum.
TL;DR
ZomboDB assumes that all transactions are aborted (until they actually commit), tracks xmax changes in real time via table triggers, and resolves visibility across all shards in parallel. It then cleans itself up when vacuum runs.
Quick Discussion around Vacuum
Previous versions of ZomboDB wanted autovacuum to be pretty aggressive. The reason for this was that ZomboDB wasn't necessarily very fast at resolving visibility when the table/index contained lots of dead rows. With v3.2.0, the landscape is quite a bit different.
Depending on the complexity of your query, and other factors (of course), v3.2.0 performs in millisecond/tens-of-millisecond times even when there are tens of thousands of dead rows in the index. As such, you might consider dialing up the autovacuum_naptime
and autovacuum_vacuum_threshold
settings in postgresql.conf
. ZomboDB can now stand up to more dead rows than in the past, and it's also capable of using Elasticsearch's query cache in more situations. When queries are answered from the cache, no searching is actually done so it doesn't really matter how many dead rows are in the table.
Additionally, lessening the vacuum load on the Postgres server, which also reduces the number of Elasticsearch index refreshes ZomboDB needs to perform, can generally help all around with disk I/O. Note that it's still super important that your tables be vacuumed often for all the reasons that vacuuming is good -- v3.2.0 just gives you a little more wiggle room around it.
Monitoring What's Happening
The views zdb_index_stats_fast
and zdb_index_stats
are very useful for watching the state of the backing Elasticsearch index. The _fast
view gives you an estimate of the number of rows in the Postgres table, so it's generally preferred as doing a "count(*)" (what the non-_fast
view does) on a multi-million row tables can be really slow.
Paying attention to the xmax_count
and aborted_count
columns from those views will give you an idea of how much "garbage" is still in the backing Elasticsearch index. Under concurrent load, with autovacuum configured, you should see these values rise during writes and drop when autovacuum runs.
Note that the aborted_count
column can be divided by the number of shards in your index to get the actual count of deleted transaction ids ZomboDB is tracking. As mentioned above, aborted transaction ids are copied to every shard.
When Postgres is idle (in terms of writes), you should see both of these columns reach zero. Because the "xmax" and "aborted" docs are removed during the "cleanup" phase of vacuum, it may be necessary to execute an explicit VACUUM table
before you'll see the columns actually hit zero.
Reindexing Sucks, WTH!?
Yeah, sorry. The new structure v3.2.0 uses to track and resolve MVCC visibility, which is really the central feature of ZomboDB, is really solid and hopefully future-proof. The hope and intent is that it'll be quite a long time before ZomboDB asks its users to reindex their data again.
Despite the need to reindex, I highly recommend that everyone upgrade, especially if you're on v3.1.13-3.1.15. v3.1.13 introduced some corruption bugs, and while v3.1.15 fixed them, it made the entire thing much slower.
v3.2.0 gets all this stuff right.