Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disaster recovery in Vespa #17898

Closed
nehajatav opened this issue May 19, 2021 · 5 comments
Closed

Disaster recovery in Vespa #17898

nehajatav opened this issue May 19, 2021 · 5 comments
Assignees
Projects

Comments

@nehajatav
Copy link

Hi Vespa team,
What are the options for disaster recovery in Vespa? I have read about 'visit' but my understanding is this operation takes backup in the form of json. The problem with that is if I want to quickly bring up a vespa cluster using this json, we have to incur time overhead of indexing these documents. Do you recommend taking periodic disk index snapshots for disaster recovery? Is there a way to pause writes, force flush to disk on all content nodes, and taking a snapshot of the disk by using something like rsync? We use rsync based backup for a lucene based search engine currently, and the differential rsync for the delta duration is pretty quick. That's based on the assumption that index contents on disk change by delta and not massively. How does Vespa cloud manage disaster recovery in case an entire datacenter gets wiped out?

@bratseth
Copy link
Member

For high availability we recommend running a live Vespa instance in at least two regions/data centers.
There aren't that many scenarios where it helps to be able to quickly start with the same index in the same data center: It won't help when the data center/network is down, when it would help it relies on manual intervention and will cause downtime, and due to Vespa's true realtime indexing and dynamic data distribution to automatically handle failures and grow/shrink it it hard to make this work. If you have trouble setting this up privately, you could do it with cloud.vespa.ai.

If you want to increase protection against catastrophic data loss, you could take periodic backups by visiting, or keep another offline copy of the data in some other system. Many want to do the latter in any case for data analysis and machine learning.

@bratseth bratseth self-assigned this May 19, 2021
@nehajatav
Copy link
Author

Does below mean setup two different Vespa instances in two different DCs? Or same Vespa instance with nodes spanning across DCs?

For high availability we recommend running a live Vespa instance in at least two regions/data centers.

@bratseth
Copy link
Member

Two separate Vespa instances.

@nehajatav
Copy link
Author

Gotcha, so keep calling 'visit' on live instance and sync the delta to non-live?

@bratseth
Copy link
Member

You can do that but it's not necessarily the most efficient way, and you'll be behind.
Other common options are to send writes to both instances in parallel from one origin, or use a messaging system with persistent queues between the origin and the Vespa instances. That comes with a cost but makes it simpler to handle cases where an instance goes down.

@johans1 johans1 closed this as completed May 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Support
Awaiting triage
Development

No branches or pull requests

3 participants