Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Add design document for Control Plane Ingress HA #3417

Merged
merged 1 commit into from
Jun 23, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
204 changes: 204 additions & 0 deletions docs/developer/architecture/control-plane-ingress.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,204 @@
Control Plane Ingress
=====================

Context
-------

Initially, Control Plane Ingress Controller was deployed using a DaemonSet with
a ClusterIP service using Bootstrap Control Plane IP as External IP and then
configuring all Control Plane components using this “External IP” (like OIDC
and various UIs).

So it means, we can only reach those components using Bootstrap Control
Plane IP, and if, for whatever reason, Bootstrap is temporary down you can no
longer access any UIs and you need to reconfigure all components manually,
change the IP used everywhere to another Control Plane node IP, in order to
restore access to Control Plane components, or if the Bootstrap is down
permanently you need to restore the Bootstrap node.

Here, we want to solve this issue and make Control Plane components Highly
Available, so if you lose one node, including Bootstrap Node, you can still
access various UIs.
NOTE: In this document, we do not talk about the High Availability of the
components itself but really only to the access through the Ingress
(e.g.: We do not want to solve salt-master HA here).

User Stories
------------

MetalK8s and Grafana UIs HA
~~~~~~~~~~~~~~~~~~~~~~~~~~~

I have a multi-node MetalK8s cluster, with at least 3 Control Plane nodes
if I lose one of the Control Plane nodes (including the Bootstrap one)
I can still access and authenticate on the MetalK8s and Grafana UIs.

Design Choices
--------------

To have a proper HA Control Plane Ingress we want to use a
`Virtual IP <https://en.wikipedia.org/wiki/Virtual_IP_address>`_ using
`MetalLB <https://metallb.universe.tf/>`_ so that we can rely on layer2
ARP requests when possible.

But in some network, it may be not possible to use this method
so we also let the possibility to not use MetalLB but instead just
assign an External IP, provided by the user, that we expect to be a
Virtual IP, and we do not manage on our side but it’s managed by the user
using whatever mechanism to switch this IP between Control Plane nodes.

To summarize 2 different deployments possible depending on the user
environment:

- Using VIP managed by Layer2 ARP with MetalLB (builtin MetalK8s)
- Using a user-provided IP that should switch between Control
Plane nodes (managed by the user)

NOTE: In those 2 approaches we want the user to provide the Control
Plane Ingress IP he wants to use.

Rejected Design Choices
-----------------------

Manage Virtual IP by MetalK8s
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Instead of using MetalLB to manage the Virtual IP we could have manage this
Virtual IP directly in MetalK8s with KeepAliveD (or any other “HA tool”) this
way we really controle the Virtual IP.

This approach was rejected because it seems to do not provide any real
advantage compare to use MetalLB directly that will manage this Virtual IP
for us and may provide a bunch of other useful feature for the future.

Rely on DNS resolution
~~~~~~~~~~~~~~~~~~~~~~

Instead of using a Virtual IP we could rely on DNS resolution use this
FQDN to configure Control Plane components. In this case we let the user
configure his DNS to resolve on an IP of one Control Plane node.

WIth this approach we let the DNS server handle the “High Availability” of
the Control Plane UIs, basically if we lose the node that resolve the DNS
then we expect the DNS server to switch to another IP of a working Control
Plane node.

This approach was rejected because it’s not a real HA as we expect DNS server
to have some intelligence which it likely not the case in most of user
environments.

Use Nodes Control Plane IPs
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Instead of using a Virtual IP or a FQDN we could still rely on Control Plane
IPs and configure all Control Plane components using either relative path
either all node Control Plane IPs. So that we can reach all Control Plane
components using any Control Plane IP.

With this approach we do not need any specific infrastructure on the user
environment but it means every time we add a new Control Plane node we need
to re-configure every Control Plane component that use Control Plane IPs but
we also need to re-configure every Control Plane component when we remove a
Control Plane node.

This approach was rejected because Control Plane components we deploy today
in MetalK8s does not seems to all support relative path and it does not
seems to be possible to have proper HA without re-configuring some Control
Plane components when we lose a Control Plane node.

Non-Goals
---------

These points may be addressed later (or not) but in this document, we focus on
a first simple deployment that (should) fit in most of the user environments.

- Manage Workload Plane Ingress HA
- Manage BGP routers

Implementation Details
----------------------

Control Plane Ingress IP
~~~~~~~~~~~~~~~~~~~~~~~~

In order to configure all Control Plane components we need a single IP as
Control Plane Ingress, so we expect the user to provide this Control Plane
Ingress IP in the Bootstrap configuration file.
To have some backward compatibility this Ingress IP is only mandatory when
you use MetalLB and will default to Bootstrap Control Plane IP if not
(so that we have the same behavior as before).

This Control Plane Ingress IP can be changed at any time just by editing
the Bootstrap configuration file and follow a simple documented procedure
with some Salt states to reconfigure every component that needs to be
re-configured.

NOTE: Changing this Control Plane Ingress IP means we need to reconfigure
all Kubernetes APIServer since we use this Ingress IP as an OIDC provider.
Comment on lines +136 to +137
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will change with oauth2-proxy I think, but we'll need to reconfigure the proxy anyway 😇

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely yes, but since we do not have any design for oauth2-proxy yet I do not mention it


MetalLB Configuration
~~~~~~~~~~~~~~~~~~~~~

MetalLB is not deployed in every environment so it needs to be enabled
from the Bootstrap configuration file, that’s why we have a new field about
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this done? What's for configuration format? How is this backwards-compatible? How can it be made forward-compatible?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can check #3418, basically, I have a metallb key in networks:controlPlane and in this dict I have enabled (which default to false) and an optional config field that get merge with a default MetalLB configuration

# MetalLB disabled by default
networks_data["controlPlane"].setdefault("metalLB", {}).setdefault("enabled", False)
if networks_data["controlPlane"]["metalLB"]["enabled"]:
if not networks_data["controlPlane"].get("ingressIP"):
errors.append(
"'ingressIP' for 'controlPlane' network is mandatory when 'metalLB'"
"is enabled"
)
else:
address_pools = (
networks_data["controlPlane"]["metalLB"]
.setdefault("config", {})
.setdefault("address-pools", [])
)
if not address_pools:
address_pools.append({})
address_pools[0].setdefault("name", "ingress-ip")
address_pools[0].setdefault("protocol", "layer2")
# Enfore address to Ingress IP
address_pools[0]["addresses"] = [
"{}/32".format(networks_data["controlPlane"]["ingressIP"])
]
address_pools[0]["auto-assign"] = False

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But as I answered in another comment below, if we do not support BGP we do not need to expose config and we can also hide "metallb" from Bootstrap config and documentation, and just put manageVIP: true/false (or something else) in the bootstrap config

MetalLB in the Control Plane network section.

Today, we only allow MetalLB using Layer2 so we do not need to make MetalLB
configuration configurable, if MetalLB is enabled in Bootstrap configuration
MetalK8s will deploy the following configuration for MetalLB:

.. code-block:: yaml

address-pools:
- name: control-plane-ingress-ip
protocol: layer2
addresses:
- <control-plane ingress ip>/32
auto-assign: false

Same as the Control Plane Ingress IP, we can switch from non-MetalLB to MetalLB
(and the opposite) at any time just by following the same procedure.

Deployment
~~~~~~~~~~

As for every other addon in MetalK8s, we will use the MetalLB helm chart and
render this one using a specific “option” file. But this one will not be always
deployed as we only want to deploy it when a specific key is set in the
Bootstrap configuration file, so in the Salt pillar at the end.

When we use MetalLB we do not want to use the same NGINX Ingress Controller
deployments, since MetalLB will be the entry point in the Kubernetes cluster
we do not need to use a DaemonSet running on every Control Plane nodes,
instead, we will use a Deployment with 2 replicas.
Comment on lines +171 to +173
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this cause unnecessary mesh networking? I guess it's not super critical for control plane, but I don't really see the benefit of switching to a Deployment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's less ressources, here we stick with 2 replicas for the full cluster (no matter the number of Control Plane nodes).
To me it's more like, "why should we use a DaemonSet when we can use a Deployment ?"
I think @NicolasT also had other arguments

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK OK, I would just vote for lowering the priority of this switch:

  • it's an architecture change, so it has an impact on users willing to operate/monitor our stuff
  • it's not a huge win (an idle Ingress controller doesn't take much resources)


We also need to configure the Service for Ingress Controller differently
depending on if we use MetalLB or not when we use it we want to use a
LoadBalancer service, set the LoadBalancerIP to IngressIP provided by
the user and set externalTrafficPolicy to Local. If we do not use MetalLB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use Local, why don't we set up the Ingress controller pods to run locally to each node? What happens if no replica runs locally?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm to me, when you use Local anyway the VIP will sit on one node that has some "pod" of the Service running locally,so it cannot happens

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK, I misunderstood this behaviour then.

then we use ClusterIP Service with IngressIP provided by the user as External
IPs.

It means the deployment of NGINX Ingress Controller depends on some Salt
pillar values, also since we want to be able to switch between MetalLB and
non-MetalLB we need to make sure the Salt states that deploy NGINX Ingress
Controller remove no-longer-needed objects (e.g.: if you switch from
non-MetalLB to MetalLB you want to remove the DaemonSet for NGINX Ingress
Controller).

Documentation
-------------

- Describe all new Bootstrap configuration fields
- Add a simple procedure to change the Control Plane Ingress IP and
reconfigure all Control Plane conponents that need to.

Test Plan
---------

Add some End-to-End tests in the CI:

- Use MetalLB and a VIP as Control Plane Ingress IP
TeddyAndrieux marked this conversation as resolved.
Show resolved Hide resolved
- Test failover of MetalLB VIP
- Change Control Plane Ingress IP using documented procedure
- Switch from non-MetalLB to MetalLB using documented procedure
1 change: 1 addition & 0 deletions docs/developer/architecture/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ Architecture Documents
centralized-cli
ci
configurations
control-plane-ingress
deployment
logs
metalk8s-ui
Expand Down