Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement HA mode #21

Open
yaroslav-gwit opened this issue May 17, 2023 · 1 comment
Open

Implement HA mode #21

yaroslav-gwit opened this issue May 17, 2023 · 1 comment
Assignees
Labels
feature request New feature request sponsored Someone has sponsored this feature, privately or publicly

Comments

@yaroslav-gwit
Copy link
Owner

yaroslav-gwit commented May 17, 2023

Some ideas for the HA mode implementation (subject to change):

  • Node self-monitoring using REST API and ha-watchdog process

  • Raft-like Consensus Algorithm, where no matter the cluster size, there are always 3 candidate nodes that control the whole cluster, and among these 3 candidate nodes there is 1 manager that makes failover decisions

  • All 3 candidate nodes must be specified manually, in the ha_config.json

  • All worker nodes are dynamically added to and removed from the cluster

  • In case of a node failure:

    • ha-watchdog process will reboot the node it's running on, which serves as a simple fencing mechanism
    • based on the failover strategy, the manager will start VMs from the failed node on other nodes in the cluster, prioritising nodes with the freshest VM snapshot
  • Notify cluster admins about the outage, include the list of VMs that were failed over and/or were ignored

  • Keep a log of things that happen overtime in plain text and JSON formats for later representation by the Hoster REST API and/or WebUI

The CLI flags to use (subject to change):

To start using HA, execute this: hoster api start --ha-mode, and make sure you are running the latest dev release.

--ha-mode - start the REST API server, and activate the HA mode
--ha-debug - only log actions, and do not actually perform them - useful for the initial cluster setup and troubleshooting

@yaroslav-gwit yaroslav-gwit self-assigned this May 17, 2023
@yaroslav-gwit yaroslav-gwit added feature request New feature request new feature Label to apply to new features development labels May 17, 2023
@yaroslav-gwit yaroslav-gwit changed the title Create new ha_watchdog_service binary Create new hoster cluster command May 18, 2023
@yaroslav-gwit yaroslav-gwit changed the title Create new hoster cluster command Implement HA mode Sep 1, 2023
@yaroslav-gwit
Copy link
Owner Author

yaroslav-gwit commented Sep 16, 2023

Additional list of requirements:

  • vm migrate to automatically stop the VM, replicate it to a new endpoint, and start it there (can only be done on the candidate nodes)
  • running api stop on a worker node, will trigger a graceful exit from the cluster for such a node (which means that it can be rebooted, or shutdown for maintenance without the cluster manager moving over it's VMs to other members)
  • implement the SSL encryption to protect the communications between nodes (it will be optional, and mostly manual due to the fact that it will rely on the organisational standards that are already in place, eg manual vs automatic SSL Certificate management, internal CAs, etc)
  • create the documentation on how to use an internal CA to sign the HTTP certificates and use them in the api to activate the SSL encryption mode between HA nodes
  • restapi_config.json - Implement REST API config file to support multi-tenancy: restapi_config.json #58

Done:

  • api stop to automatically detect the HA mode, if the node where this command is executed belongs to a candidate group, then it will notify other cluster members to stop the HA mode, otherwise the command will fail asking user to run it on one of the candidate nodes
  • api status to return production or debug HA mode within the status information
  • restapi_config.json has been implemented
  • vm migrate option has been dropped for now, because we will migrating from SSH to REST API for the most functions and it will slow down that project migration

Still to do:

  • SSL/HTTPs integration/implementation
  • Docs for the SSL/HTTPs
  • worker's clean disconnect

@yaroslav-gwit yaroslav-gwit added sponsored Someone has sponsored this feature, privately or publicly and removed in-progress new feature Label to apply to new features development labels Jan 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature request sponsored Someone has sponsored this feature, privately or publicly
Projects
Status: In Progress
Development

No branches or pull requests

1 participant