Disaster recovery in Vespa #17898

nehajatav · 2021-05-19T11:02:37Z

Hi Vespa team,
What are the options for disaster recovery in Vespa? I have read about 'visit' but my understanding is this operation takes backup in the form of json. The problem with that is if I want to quickly bring up a vespa cluster using this json, we have to incur time overhead of indexing these documents. Do you recommend taking periodic disk index snapshots for disaster recovery? Is there a way to pause writes, force flush to disk on all content nodes, and taking a snapshot of the disk by using something like rsync? We use rsync based backup for a lucene based search engine currently, and the differential rsync for the delta duration is pretty quick. That's based on the assumption that index contents on disk change by delta and not massively. How does Vespa cloud manage disaster recovery in case an entire datacenter gets wiped out?

bratseth · 2021-05-19T11:46:57Z

For high availability we recommend running a live Vespa instance in at least two regions/data centers.
There aren't that many scenarios where it helps to be able to quickly start with the same index in the same data center: It won't help when the data center/network is down, when it would help it relies on manual intervention and will cause downtime, and due to Vespa's true realtime indexing and dynamic data distribution to automatically handle failures and grow/shrink it it hard to make this work. If you have trouble setting this up privately, you could do it with cloud.vespa.ai.

If you want to increase protection against catastrophic data loss, you could take periodic backups by visiting, or keep another offline copy of the data in some other system. Many want to do the latter in any case for data analysis and machine learning.

nehajatav · 2021-05-19T18:34:10Z

Does below mean setup two different Vespa instances in two different DCs? Or same Vespa instance with nodes spanning across DCs?

For high availability we recommend running a live Vespa instance in at least two regions/data centers.

bratseth · 2021-05-19T19:24:47Z

Two separate Vespa instances.

nehajatav · 2021-05-19T20:25:23Z

Gotcha, so keep calling 'visit' on live instance and sync the delta to non-live?

bratseth · 2021-05-19T20:47:45Z

You can do that but it's not necessarily the most efficient way, and you'll be behind.
Other common options are to send writes to both instances in parallel from one origin, or use a messaging system with persistent queues between the origin and the Vespa instances. That comes with a cost but makes it simpler to handle cases where an instance goes down.

bratseth self-assigned this May 19, 2021

johans1 closed this as completed May 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disaster recovery in Vespa #17898

Disaster recovery in Vespa #17898

nehajatav commented May 19, 2021

bratseth commented May 19, 2021

nehajatav commented May 19, 2021

bratseth commented May 19, 2021

nehajatav commented May 19, 2021

bratseth commented May 19, 2021

Disaster recovery in Vespa #17898

Disaster recovery in Vespa #17898

Comments

nehajatav commented May 19, 2021

bratseth commented May 19, 2021

nehajatav commented May 19, 2021

bratseth commented May 19, 2021

nehajatav commented May 19, 2021

bratseth commented May 19, 2021