Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High Availability (DR) #1540

Closed
ISECNOC opened this issue Sep 16, 2016 · 6 comments
Closed

High Availability (DR) #1540

ISECNOC opened this issue Sep 16, 2016 · 6 comments
Milestone

Comments

@ISECNOC
Copy link

ISECNOC commented Sep 16, 2016

I've been working in IT-finance for a little bit more than 6 years now and we've been audited numerous of times and the questions are mostly the same.
Recently, lets say the past 2 years there is a new repeated question: Do you have a DR-plan and how often do you test it? How long will it take to restore your infrastructure?

So I started thinking about this and I decided that replicating the SAN is probably the way together with the built in Disaster Recovery-feature in XenServer.
This was the main goal untill I started using XOA and I realized that XOA will probably be able to do everything for me, there is already a DR-Copy feature in the Backup-section, which does most of the stuff I want it to.

I am thinking of a new feature called "High Availability (DR)" which would work this way:

The scenario is the following, you have site A and site B. You have Pool A and B. You have SR A and B where A is connected to Pool A and vise versa.

  1. You setup a "High Availability (DR)" job, exactly like you do backup-jobs (Simular functionality as DR-Copy job) to copy "important" VMs from site A, Pool A, SR A to Site B, Pool B and SR B.
    The schedule can be everything from once per week, once per day or even once per hour if you have enought resources and bandwidth available.

  2. XOA then monitors the source VM and source pool, if there is a normal failure HA will automaticly try to restart the VM in the same pool. But what if the whole pool goes down or even the site?

This is when it becomes interesting, because in my world XOA would try to reach the remote pool, and if it fails it should start the copied VMs on Site B, Pool B from SR B automaticly or maybe trigger some kind of alert which lets the user choose if the DR should be iniated.

Im not sure if you follow me, but I hope you are so far.

Now think of a scenario where you want to test this, how would it be done?
Well in my world it should be pretty simple, this is how I am thinking:

  1. You inform everyone about a failover-simulation

  2. You go into this job and you click "Test failover" and it prompts you again informing you that there will be a interruption on your production site where this test is performed. Also it informs you that there need to be free ram, CPU and such at the destination pool.

  3. HA is disabled at the source pool.

  4. The VMs in the source pool will be paused/suspended

  5. The copied VM's will be started - This is when you can try to reach a application or such hosted at such a VM to really confirm that it is running properly.

  6. There is a pre-defined time, lets say default is 10min and then the copied VM's will be halted.

  7. The VM's at the source pool is now unpaused so that normal production can continue. This is done only when either it is automaticly confirmed that the VM's on the DR site has been halted or the user "overrides" any failures and tells XOA to move on with the next step.

  8. HA is activated at the source pool.

  9. A report is generated with status, time and simular information that PWC and other auditers wants to look at.

If there is any questions please do not hesitate to contact me!
At IRC or email, nikade@freenode or niklas.ahden@isec.com

@olivierlambert olivierlambert added this to the Long term milestone Sep 16, 2016
@fufroma
Copy link
Contributor

fufroma commented Sep 16, 2016

Link of Pool Replication for futur me reading this.

@ISECNOC
Copy link
Author

ISECNOC commented Sep 16, 2016

I know the Pool Replication-feature. But it requires me to have the same storage at both sites.
If XOA handles it I can have any type of destination SR since XOA doesnt care as long as it is reachable by one of the pools.

Scenario:
Site A, Pool A, SR A is hosted by a Nexentastor machine with license and all those fancy things.

Site B, Pool B, SR B is hosted by a Dell Equallogic SAN or ANY NFS/iSCSI-able storage appliance.

As long as XOA see's SR B on Pool B it will be able to copy the VM.

@olivierlambert
Copy link
Member

Idea added to the wiki

@olivierlambert
Copy link
Member

@ISECNOC how to know a VM should be considered as down? Just its power state?

@nikade87
Copy link

@olivierlambert Yeah that is a good idea. Since XOA would handle both pool A and B I dont see why that wouldn't work.

@nikade87
Copy link

I noticed that I am logged in with my personal github account, but the answer is the same :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants