Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replication Manager on Kubernetes #298

Closed
tafkam opened this issue Apr 17, 2020 · 32 comments
Closed

Replication Manager on Kubernetes #298

tafkam opened this issue Apr 17, 2020 · 32 comments
Assignees

Comments

@tafkam
Copy link

tafkam commented Apr 17, 2020

Is it possible to use environment variables in the replication-manager main configuration? I have found references for such a feature for the provisioning agent like {env.nodes} in the documentation.
Using environment variables would help with using replication-manager in kubernetes by referencing mariadb passwords from managed databases with "env.valueFrom.secretKeyRef".

On a side note, since replication-manager has already lots of the needed functionality for kubernetes operators(like provisioning, managing, monitoring, failover, using golang, etc.), and waaayy more features and options for high-availability databases than the mariadb/mysql operators I've tried, you should consider going forward with a kubernetes operator edition for replication-manager.

@svaroqui
Copy link
Collaborator

Hey thanks tafkam,

If your are talking about the config variables of replication-manager i need to dig but we have already a password encryption feature, this key can be extract from secret map of ks8 and put into the repman pod of docker, podman to enable password decryption. In opensvc we can store a map value to file via shm i guess this possible in K8S? if not i need to fetch about

./replication-manager-pro --config=etc/local/features/backup/config.toml.sample monitor --help | wc -l
382
ENV variables :(

@tafkam
Copy link
Author

tafkam commented Apr 17, 2020

If you are referencing the aes encryption described in https://docs.signal18.io/configuration/security , this won't help with kubernetes integration. Kubernetes secrets are base64 encoded, and can be injected into the runtime environment of the docker image, or mounted as a file into the docker image, both as plaintext. For replication-manager to use existing kubernetes mariadb password secrets they need to be read from the enviroment (or a defined file location where the secret is mounted). Using the environment is preferred since the secret files generated by different helm charts, operators, boilerplate manifests etc. differ in key:value formats. 'user: password' is seldomly used here.

The amount of options for replication-manager has nothing to do with replacing defined environment variables in the configuration file with actual values from the environment. if you see {env.root_password} in the config file it just needs to be replaced with the value returned by the function getEnv('root_password')

@svaroqui
Copy link
Collaborator

Yes i undertsand your point , i'll keep i mind for feature request

@svaroqui svaroqui self-assigned this Apr 17, 2020
@svaroqui
Copy link
Collaborator

Can we plan a conf call next week on skype (svaroqui) or zoom to better define what can be done here ?

If i get you you correctly:
Feature 1:
repman on startup replace default config variables on every password related options with equi env variables define in the container

Feature 2:
repman provision a k8s secret map to all services password variables and every services provisioned later on refer to the key map instead of a plain password

@tafkam
Copy link
Author

tafkam commented Apr 17, 2020

I don't know for what use case feature 2 would be good for. I didn't ask for something like that ;-)

What I would like to have is Feature1 for any configuration option, not just passwords. For example you could set FAILOVER_MODE="automatic" as a environment variable and use it in failover-mode={env.FAILOVER_MODE} . Or you set REPLICATION_USER and REPLICATION_PASSWORD as environment variables and can do something like replication-credential = "{env.REPLICATION_USER}:{env.REPLICATION_PASSWORD}"

Anyways I bumped into another issue which is kind of a dealbreaker for using replication-manager properly on kubernetes. It seems replication-manager is caching the dns lookup for a server indefinitely and stops looking up IPs if it encounters a resolv error. When using a MariaDB statefulset with headless service and a db instance crashes/is restarted/updated the new pod will also have a new IP address. replication-manager will never see the up and running db instance, since it's not doing any more dns lookups.

There is probably a few more issues like those in the kubernetes context, which would be a bunch of work to figure out in every possible supported replication-manager configuration. Maybe the replication-manager core developers should first figure out if they want to go down the kubernetes route. Looking on the whole lot of features replication-manager provides in classical server world, there would be lots of changes and additions to provide the same functionality in kubernetes. I guess at that point a fork and rewrite would be easier, than having a "can do everything everywhere" code-base ;-) If you want to open the can of worms that kubernetes is, here is a good starting point https://github.com/operator-framework/operator-sdk

@svaroqui
Copy link
Collaborator

svaroqui commented Apr 18, 2020 via email

@tafkam
Copy link
Author

tafkam commented Apr 18, 2020

Using db-servers-hosts = "sts-mariadb-0.sts-mariadb:3306,sts-mariadb-1.sts-mariadb:3306" as cluster configuration. When I start both statefulset pods and then deploy replication-manager it will see both instances as running (no replication started yet, "out-of-the-box" mariadb docker images with a server configuration for master/slave replication)

Then i started replication-manager when sts-mariadb-1 instance wasn't up yet, getting the error

2020/04/17 18:41:51 | STATE | RESOLV ERR00062 : DNS resolution for host sts-mariadb-1.sts-mariadb error lookup sts-mariadb-1.sts-mariadb on 169.254.20.10:53: server misbehaving
2020/04/17 18:41:58 | INFO | Declaring slave db sts-mariadb-1.sts-mariadb:3306 as failed
2020/04/17 18:41:58 | ALERT | Server sts-mariadb-1.sts-mariadb:3306 state changed from Suspect to Failed

since the pod ip can't be resolved with the cluster dns yet. When sts-mariadb-1 instance finished loading, and the domain is resolving, replication-manager instance status isn't changing.

sts-mariadb-0 is up and running at that time. so i'm killing that pod for tests:

2020/04/17 18:42:24 | INFO | Declaring slave db sts-mariadb-0.sts-mariadb:3306 as failed
2020/04/17 18:42:24 | ALERT | Server sts-mariadb-0.sts-mariadb:3306 state changed from Suspect to Failed

sts-mariadb-0 is getting a new pod IP and the headless service domain sts-mariadb-0.sts-mariadb changed accordingly.

Both server pods were restarted and are running in kubernetes and can be connected to with the headless service domain. Both are still marked as failed in replication-manager. The dns resolv error for the sts-mariadb-1 is just an indication for me that the dns lookup isn't repeated.
Maybe the dns resolver is working differently for the OpenSVC functionality?

@svaroqui
Copy link
Collaborator

What release are you using 2.1 docker image ?

@svaroqui
Copy link
Collaborator

svaroqui commented Apr 18, 2020

Anyway should not use the docker container hostname , that will never work like this you should use the DNS of the orchestrator that is 99% sure a K8S config issue regarding DNS that do not propagate DNS change to containers hostnames your service name should looks like my-svc.my-namespace.svc.cluster.local

@svaroqui
Copy link
Collaborator

when repman deployed database and proxy you have the possibility to set the cluster part but the namespace is harcoded to the cluster name

@tafkam
Copy link
Author

tafkam commented Apr 18, 2020

I'm using the latest replication-manager (2.1) release from dockerhub. There is no container "hostname" in Kubernetes... Im using the headless service DNS which expands to sts-mariadb-0.sts-mariadb.namespace.svc.cluster.local.

Also I can assure you there is no dns issue in my kubernetes cluster. I'm running dozens of different services all depending on working dns.

@svaroqui
Copy link
Collaborator

svaroqui commented Apr 18, 2020

humm intresting i'll make my study to invetigate that but apparently quick googling for dns cache resolv on go-mysql driver does not pop any issues or configuration , i would agree with you if the code of replication-manager itself do revers DNS and store the result in a local variable , indeed we stop doing this a long time ago .

@svaroqui
Copy link
Collaborator

Can you try the docker pro release, the one i use for testing orchetratiors, i think there are different net implementation at compile time

@tafkam
Copy link
Author

tafkam commented Apr 18, 2020

Beside some errors about OpenSVC the pro release is resolving new pod IPs and the instance leaves the failed state. I have ran into another issue not related to replication-manager, so I can't test the replication bootstrap and failover functionalities, yet.
I will get back to you, when I can test more of replication-manager. I'm changing the issue title since this is getting a bit out of scope here ;-)

@tafkam tafkam changed the title Extend configuration with environment vars Replication Manager on Kubernetes Apr 18, 2020
@svaroqui
Copy link
Collaborator

Please feel free to ping us on any feature update or request, may be get in contact for a talk to explain how we are moving with the product an why !

@svaroqui
Copy link
Collaborator

Majority of our sponsors use replication-manager on premise for backups, monitoring and HA but other are using mostly the API for bootstraping replication, switchover or trigger rolling restart on config changes ( slapos orchestrator) , others just use it fully integrated with opensvc for cluster deployment. are you already using init container for database and proxy container?

@tafkam
Copy link
Author

tafkam commented Apr 18, 2020

Basically I would just like to setup several Mariadb Master/Slave replication clusters on kubernetes with Maxscale (or whatever solution) for read/write split and master auto failover and rejoins. For Maxscale to work the failover magic I need to have gtid replication, which none of the existing kubernetes operators or docker images I've found is supporting, and I don't want to initially setup gtid replication manually for new Mariadb clusters. I would like to not use Galera since I've had bad experiences with cluster wide locking, and recovery is exhausting, more so in the containerized world.
And then I found replication-manager ;-) So I'm mainly interested in (semi-)automatic replication bootstrapping, failure detection, failover and cluster recovery functionalities. Rolling restarts and updates are handled by kubernetes quite well, and I already have a backup solution(appscode stash). Monitoring is done with a prometheus exporter sidecar.
Since replication-manager's autofailover and recovery features could replace Maxscale in that regard, I plan to evaluate proxysql later if everything should work on the replication side.
That said I try to use most of the standard functionalities of the available docker containers and try to not build my own docker image. Bitnamis Mariadb init scripts were interfering with external replication changes, so I'm back to the official Mariadb docker image for which I now have to find an elegant solution to change the server-id for each statefulset instance in the my.cnf config file since I dont want to create several deployments and configmaps for master/slaves in kubernetes.

@tanji
Copy link
Collaborator

tanji commented Apr 19, 2020

Hi,

the Pro version uses the DNS resolver from operating system (libc binding on Linux)
the Std uses a pure Go version

if you find that you have resolve issues with the Std version, try this before running it:

export GODEBUG=netdns=cgo

It will force it to run with the C binding. I'm interested to know if that solves any issues, in which case we can change our compile-time settings.

@svaroqui
Copy link
Collaborator

svaroqui commented Apr 19, 2020 via email

@tafkam
Copy link
Author

tafkam commented Apr 21, 2020

Hi,

the Pro version uses the DNS resolver from operating system (libc binding on Linux)
the Std uses a pure Go version

if you find that you have resolve issues with the Std version, try this before running it:

export GODEBUG=netdns=cgo

It will force it to run with the C binding. I'm interested to know if that solves any issues, in which case we can change our compile-time settings.

Sadly I can't reproduce the error situation anymore. The standard replication-manager and -pro replication manager are now behaving the same way in that regard. Have you changed your compile-time settings already for signal18/replication-manager:2.1? I'm always pulling the latest image for tests.

@tafkam
Copy link
Author

tafkam commented Apr 21, 2020

After a bit more fiddling the managed replication works now! There are some irritating logs though:

time="2020-04-21T20:17:41Z" level=info msg="Enforce GTID replication on slave sts-mariadb-1.sts-mariadb:3306" cluster=sts
time="2020-04-21T20:17:44Z" level=warning msg="Cluster state down" cluster=sts code=ERR00021 status=RESOLV type=state
time="2020-04-21T20:17:44Z" level=warning msg="Could not find a slave in topology" cluster=sts code=ERR00010 status=RESOLV type=state
time="2020-04-21T20:17:44Z" level=warning msg="Could not find a master in topology" cluster=sts code=ERR00012 status=RESOLV type=state
time="2020-04-21T20:17:44Z" level=warning msg="Monitor freeze while running critical section" cluster=sts code=ERR00001 status=RESOLV type=state

Still working on the maxscale/proxysql part to get the full cluster working, but the replication-manager part works so far. Thanks for all the help and hints!

Here are some general suggestions/feature requests which I would've found useful on my journey ;-):

  • aforementioned "templating" of the config file with arbitrary docker shell environment vars (like what the "substitute_variables" option is doing in maxscale)
  • make the base-href path for the dashboard configurable so you can easily use replication-manager behind a reverse-proxy in a non-root path (eg. https://some.clustermanagement.site/replication-manager/)
  • function to create the defined replication user/password on the database servers, since the db root/admin credentials have to be provided anyways, I guess
  • option to automatically replication-bootstrap a cluster if all servers are detected as "standalone" (im not 100% sure if this a safe requirement for replication bootstrapping)
  • the dashboard seems to have some memory issues and crashes if it's running some time in the background (chrome based browser)
  • some example manifests for database setup (I can provide mine for mariadb master/slave gtid if you want them)

more advanced features which would be possible for kubernetes:

  • using custom resource definitions to define cluster configuration blocks which are then aggregated to the whole replication-manager config
  • using custom resource definitions to provision database clusters and their replication-manager configuration block in kubernetes

Thanks again for the kind help!

@svaroqui
Copy link
Collaborator

I pushed Env variables setting
Can you try it if you get time ?

@tafkam
Copy link
Author

tafkam commented Apr 23, 2020

Sure, what is the expected variable format? $VAR?

@svaroqui
Copy link
Collaborator

svaroqui commented Apr 23, 2020

yes upper case equivalent config variable with s/-/_/g

@tafkam
Copy link
Author

tafkam commented Apr 23, 2020

Tried db-servers-hosts = "$SERVER1,$SERVER2" first, but replication manager used the literal string value.
Then looking at your commit changes, the environment loading seems only to work in the default section (SetEnvPrefix("DEFAULT"))? Setting the environment variable FAILOVER=automatic and failover-mode = "$FAILOVER" in the config did indeed work.

@svaroqui
Copy link
Collaborator

Yes only literal i can may be do something for db-servers-hosts, i already have a wrapper for IPV6 on this variable ?

@tafkam
Copy link
Author

tafkam commented Apr 24, 2020

Wouldn't it be easier and more flexible if you replace all ${arbitrary_env_var} occurences in the config file with os.Getenv(arbitrary_env_var) on load like in https://github.com/signal18/replication-manager/blob/2.1/utils/misc/env_vars.go ?

@svaroqui
Copy link
Collaborator

Yes possibly i'm gone do this for templating

@tafkam
Copy link
Author

tafkam commented Apr 24, 2020

I'm testing a Master-Master Replication setup. When both Mariadb instances are started, replication-manager acknowledges the multi-master setup of the cluster. But there is no gtid replication set up yet and both DBs are stand-alone.
In the Dashboard under Cluster/Replication Bootstrap is no Multi Master replication option. To get the gtid replication running I have to bootstrap Master-Slave Positional. The cluster is of the type master-slave then, and the slave has gtid replication configured. After doing a switchover, the former master also gets gtid replication configured. After that I'm doing a Multi Master bootstrap but nothing happens and the cluster stays in master-slave mode.
When I restart the replication-manager pod after all that, the multi-master setup is recognized by replication-manager, and the multi-master gtid replication is working properly

How should multi-master setups be bootstrapped properly? This seems unintuitive/buggy to me.

@svaroqui
Copy link
Collaborator

svaroqui commented Apr 25, 2020 via email

@tafkam
Copy link
Author

tafkam commented Apr 25, 2020

The gui is exposing the option to bootstrap multi-master, but not when the cluster is detected as multi-master ;-) so that's probably the bug

@svaroqui
Copy link
Collaborator

svaroqui commented Apr 26, 2020 via email

@tanji tanji closed this as completed Jun 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants