Skip to content

Backup and restore overview in Babelfish

Alex Kasko edited this page Mar 24, 2024 · 15 revisions

On this page:

Introduction

Backing up an RDBMS is an essential task to ensure that data recovery can be performed in case of hardware or other failures. Backups need to be done regularly and restore tests need to be run on these backups to check that all required data is there and that it can actually be restored when needed.

With Babelfish some of its implementation details (multiple T-SQL DBs inside a single PostgreSQL DB, Babelfish metadata caltalogs) make backups less straightforward than with plain Postgres.

In this article we will talk mostly about "full DB" backups, when DB data is backed up along with DB schema so the whole DB can be restored on another machine. Other kinds of backups can be: partial (only some DB objects), incremental (only changed data, not yet supported in stable Postgres, coming in Pg17) or data only (data export/import).

File system level backups

Each PostgreSQL instance stores all DB files and settings in a dedicated DB cluster file system directory. This directory is self-contained and, when copied to another machine, can be used as-is with the same version of Postgres.

WiltonDB on Windows stores default DB cluster files inside <install_dir>/data directory (see Changing DB data directory to use different directory). On Linux DB cluster is stored under /var/lib/pgsql/data or /var/lib/postgresql/15/main depending on a distro.

fs_02

Copying the whole DB cluster is a supported backup approach, but it comes with a big caveat: Postgres instance must be completely stopped before copying files. Accidentally copying DB cluster files of a running instance will produce corrupted backup even if this instance is idle.

Besides that, it is not possible to backup only a subset of DB files - all cluster files need to be copied to have a consistent backup.

SQL-level backups over TDS connection

Babelfish allows to access the same DB data over both Postgres (default port 5432) and TDS (default port 1433) connections. All data can be accessed over TDS connection using ordinary SQL queries. And definition for tables and other DB objects in most cases can be obtained from metadata introspection views (information_schema views and others).

For example, if we create some DB object:

create view myview1 as select 42

We can obtain original definition of this object from information_schema:

select view_definition from information_schema.views where table_name = 'myview1'

So, in theory, using only TDS connection, it is possible to enumerate all DB tables, get colunms description for these tables (to generate DDL definition) and fetch all data from these tables to generate insert statements. Such tools exist for MSSQL and some of them may be compatible with WiltonDB.

For example, DBeaver tool is known to work correctly with WiltonDB (using default MSSQL JDBC drivers), see DBeaver wiki for details.

backup_02

Using this approach for "full DB" backups may be problematic for following reasons:

  1. Besides tables and views there are many DB object kinds in T-SQL (namespaces, triggers, domains, extended properties etc). We are not aware of any tool that can get definitions of all DB objects through SQL introspection with WiltonDB.

  2. To get a consistent backup it is necessary to have selected logical T-SQL DB in a completely idle state, when the introspection/fetch tool connection is the only connection to this DB. This may be not easy to achieve on live DB instances.

While the points above make SQL-level backups infeasible as "full DB" backups, this approach still can be valuable for backups of particular tables and for moving data between MSSQL and WiltonDB, see also Data import with bcp

Dump backups over PostgreSQL connection

Most RDBMS do not require client tools to perform manual introspection to do backups. Usually first-party standards tools exist for this that are specific for particular RDBMS. In Postgres such tools are pg_dump and pg_dumpall.

In general, pg_dump is doing more or less the same as some custom T-SQL tool from previous section - it performs introspection of DB objects and writes definitions of these objects and all table data to a file system backup file that can be restored using psql or pg_restore utilities. Just unlike third-party SQL-level tools, pg_dump uses internal API for introspection, and, being the part of core Postgres, is guaranteed to correctly enumerate and dump all kinds of DB objects.

pg_dump also allows to create consistent backups on live DB instances without any additional setup - it uses MVCC features to automatically enable Serializable isolation level at the beginning of a backup run. It supports many options allowing to do backups in different formats, use compression or select a subset of DB objects.

pg_dump is a go-to tool for plain Postgres DBs (though there are also other popular backup tools like pgBackRest) but things get more complicated with Babelfish-flavoured Postgres. Two related features of Babelfish require additional care when using pg_dump - multiple "logical" DBs and Babelfish system catalogs.

Multiple logical DBs

Babelfish stores all its data in a single "physical" Postgres DB, it is named wilton in WiltonDB and babelfish_db in AWS Aurora. While Babelfish supports both single-db and multi-db modes in this article we assume that multi-db mode is used. In this case, when new logical T-SQL DB is created using create database mydb1, under the hood a new DB schema mydb1_dbo is created for it inside physical DB.

116

pg_dump treats physical DB as a target for a "full DB" backup. But with multiple T-SQL DBs we want to backup only a single selected logical DB, not all logical DBs at once. Default unmodified pg_dump cannot do this.

Babelfish-flavoured pg_dump

Babelfish includes a modified version of pg_dump that supports --bbf-database-name option that allows to specify a name of a logical DB.

This modified pg_dump also contains special logic for backing up Babelfish system catalogs that store metadata for T-SQL objects:

107

Restoring DB under different name

One comon scenario we've observed in practice is to restore DB backup under a different DB name. For example, to take a mydb1 backup from a live server and restore it to dev machine under mydb1_mycustomer1_date1 name along with other versions of the same DB that already exist on dev machine. This scenario works well with MSSQL where DBs can be easily renamed.

Unfortunately in Babelfish (as of version 3.3) logical DBs cannot be renamed. And in Postgres pg_dump writes fully qualified names into backup file, including DB schema name that is used to store logical DB name.

To work around this limitation WiltonDB provides pgdump_toc_rewrite library (and a command line tool) that can be used to rewrite logical DB name in backup files. When backup file is being restored into an empty Babelfish instance, server does not have any information about DB name except the name encoded in backup files. pgdump_toc_rewrite rewrites DB name in both TOC (table of contents) file and also in backed up data files of Babelfish catalogs.

Besides the catalog table entries, that can be adjusted with pgdump_toc_rewrite, Babelfish also requires a number of DB server roles for every logical DB. These roles are global for the whole DB cluster thus they are not included into pg_dump output that is produced for a particular (in our case, always for a Babelfish one) Postgres DB.

Global roles for a logical DB can be backed up in SQL form using pg_dumpall utility and then adjusted as needed befor restore. They also can be created manually, see Creating server roles for new DB name for details.

GUI Backup tool

WiltonDB provides a GUI Backup tool that uses pg_dump, pg_restore utilities and pgdump_toc_rewrite library under the hood:

104

See WiltonDB Backup GUI tool for details.

See also: Scripting backup and restore operations.

Clone this wiki locally