Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
198 changes: 27 additions & 171 deletions modules/contributor/pages/adr/ADR029-database-connection.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ Technical Story: https://github.com/stackabletech/issues/issues/238

NOTE: We might want to incorporate changes to address https://github.com/stackabletech/issues/issues/681, maybe as V2?

NOTE: Parts of this document might be out of date. The source of truth is in https://github.com/stackabletech/operator-rs/tree/main/crates/stackable-operator/src/database_connections[the finished implementation in operator-rs]

== Context and Problem Statement

Many products supported by the Stackable Data Platform require databases to store metadata. Currently there is no uniform, consistent way to define database connections. In addition, some Stackable operators define database credentials to be provided inline and in plain text in the cluster definitions.
Expand Down Expand Up @@ -179,16 +181,14 @@ NOTE: This proposal was rejected because for the same reason as the first propos

=== (accepted) Product supported and a generic DB specifications.

It seems that an unique, platform wide mechanism to describe database connections that also fulfills all acceptance criteria is not feasable. Database drivers and product configurations are too diverse and cannot be forced into a type safe specification.
It seems that an unique, platform wide mechanism to describe database connections that also fulfills all acceptance criteria is not feasible. Database drivers and product configurations are too diverse and cannot be forced into a type safe specification.

Thus the single, global connection manifest needs to split into two different categories, each covering a subset of the acceptance criteria:

1. A database specific mechanism. This allows to catch misconfigurations early, it promotes good documentation and uniformity inside the platform.
2. An operator specific mechanism. This is a wildcard that can be used to configure database connections that are not officially supported by the products but that can still be partially validated early.

The first mechanism requires the operator framwork to provide predefined structures and supporting functions for widely available database systems such as: PostgreSQL, MySQL, MariaDB, Oracle, SQLite, Derby, Redis and so on. This doesn't mean that all products can be configured with all DB implementations. The product definitions will only allow the subset that is officially supported by the products.

The second mechanism is operator/product specific and it contains mostly a pass-through list of relevant **product properties**. There is at least one exception, and that is the handling of user credentials which still need to be provisioned in a secure way (as long as the product supports it).
The first mechanism requires the operator framework to provide predefined structures and supporting functions for widely available database systems such as: PostgreSQL, MySQL, MariaDB, Oracle, SQLite, Derby, Redis and so on. This doesn't mean that all products can be configured with all DB implementations. The product definitions will only allow the subset that is officially supported by the products. For that, every product operator defines a complex enum of exactly the databases it supports.

==== Database specific manifests

Expand All @@ -198,189 +198,45 @@ Support for the following database systems is planned. Additional systems may be

[source,yaml]
postgresql:
host: postgresql # mandatory
host: my-airflow.default.svc.cluster.local # mandatory
database: my_database # mandatory
port: 5432 # optional, default is 5432
instance: my-database # mandatory
credentials: my-application-credentials # mandatory. key username and password
parameters: {} # optional
tls: secure-connection-class-name # optional
auth: authentication-class-name # optional. authentication class to use.
credentialsSecretName: airflow-postgresql-credentials # mandatory
parameters:
createDatabaseIfNotExist: true
foo: bar

PostgreSQL supports multiple authentication mechanisms as described https://www.postgresql.org/docs/9.1/auth-pg-hba-conf.html[here].

2.) MySQL

[source,yaml]
mysql:
host: mysql # mandatory
host: my-airflow.default.svc.cluster.local
database: my_database
port: 3306 # optional, default is 3306
instance: my-database # mandatory
credentials: my-application-credentials # mandatory. key username and password
parameters: {} # optional
tls: secure-connection-class-name # optional
auth: authentication-class-name # optional. authentication class to use.
credentialsSecretName: airflow-mysql-credentials # mandatory
parameters:
createDatabaseIfNotExist: true
foo: bar

MySQL supports multiple authentication mechanisms as described https://dev.mysql.com/doc/refman/8.0/en/socket-pluggable-authentication.html[here].

3.) Derby

Derby is used often as an embedded database for testing and prototyping ideas and implementations. It's not recommended for production use-cases.

[source,yaml]
derby:
location: /tmp/my-database/ # optional, defaults to /tmp/derby-<some-suffix>/derby.db


==== Product specific manifests

1.) Apache Druid

Apache Druid clusters can be configured any of the DB specific manifests from above. In addition, a DB generic configuration can pe specified:

The following example shows how to configure the metadata storage for a Druid cluster using either one of the supported back-ends or a generic system. In a production setting only the PostgreSQL or MySQL manifests should be used.

[source,yaml]
generic:
driver: postgresql # mandatory
uri: jdbc:postgresql://<host>/druid?foo;bar # mandatory
credentialsSecret: my-secret # mandatory. key username + password

The above is translated into the following Java properties:

[source]
druid.metadata.storage.type=postgresql
druid.metadata.storage.connector.connectURI=jdbc:postgresql://<host>/druid?foo;bar
druid.metadata.storage.connector.user=druid
druid.metadata.storage.connector.password=druid

2.) Apache Superset
3.) Redis

NOTE: Superset supports a very wide range of database systems as described https://superset.apache.org/user-docs/databases/#installing-database-drivers[here]. Not all of them are suitable for metadata storage.

Connections to Apache Hive, Apache Druid and Trino clusters deployed as part of the SDP platform can be automated by using discovery configuration maps. In this case, the only attribute to configure is the name of the discovery config map of the appropriate system.

In addition, a generic way to configure a database connection looks as follows:
We need Redis e.g. for celery brokers or result databases.

[source,yaml]
generic:
secret: superset-metadata-secret # mandatory. A secret naming with one entry called "key". Used to encrypt metadata and session cookies.
template: postgresql://{{SUPERSET_DB_USER}}:{{SUPERSET_DB_PASS}}@postgres.default.svc.local/superset&param1=value1&param2=value2 # mandatory
templateSecret: my-secret # optional
SUPERSET_DB_USER: ...
SUPERSET_DB_PASS: ...

The template attribute allows to specify the full connection string as required by Superset (and the underlying SQLAlchemy framework). Variables in the template are specified within `{{` and `}}` markers and their contents is replaced with the corresponding field in the `templateSecret` object.

3.) Apache Hive

For production environments, we recommend PostgreSQL back-end and for development, Derby.

A generic connection can be configured as follows:

[source,yaml]
generic:
driver: org.postgresql.Driver # mandatory
uri: jdbc:postgresql://postgresql.us-west-2.rds.amazonaws.com:5432/mypgdb # mandatory
credentialsSecret: my-secret # mandatory (?). key username + password

4.) Apache Airflow
redis:
host: my-redis # mandatory
port: 6379 # optional, default is 6379
databaseId: 13 # optional, defaults to 0
credentialsSecretName: redis-credentials # mandatory

A generic Airflow database connection can be configured in a similar fashion with Superset:
4.) Derby

[source,yaml]
generic:
template: postgresql://{{AIRFLOW_DB_USER}}:{{AIRFLOW_DB_PASS}}@postgres.default.svc.local/superset&param1=value1&param2=value2 # mandatory
templateSecret: my-secret # optional
AIRFLOW_DB_USER: ...
AIRFLOW_DB_PASS: ...

The resulting CRDs look like:
Derby is used often as an embedded database for testing and prototyping ideas and implementations. It's not recommended for production use-cases.

[source,yaml]
----
kind: DruidCluster
spec:
clusterConfig:
metadataDatabase:
postgresql:
host: postgresql # mandatory
port: 5432 # defaults to some port number - depending on whether tls is enabled
database: druid # mandatory
credentials: postgresql-credentials # mandatory. key username and password
parameters: {} # optional BTreeMap<String, String>
mysql:
host: mysql # mandatory
port: 3306 # defaults to some port number - depending on whether tls is enabled
database: druid # mandatory
credentials: mysql-credentials # mandatory. key username and password
parameters: {} # optional BTreeMap<String, String>
derby:
location: /tmp/derby/ # optional, defaults to /tmp/derby-<some-suffix>/derby.db
generic:
driver: postgresql # mandatory
uri: jdbc:postgresql://<host>/druid?foo;bar # mandatory
credentialsSecret: my-secret # mandatory. key username + password
# druid.metadata.storage.type=postgresql
# druid.metadata.storage.connector.connectURI=jdbc:postgresql://<host>/druid
# druid.metadata.storage.connector.user=druid
# druid.metadata.storage.connector.password=druid
---
kind: SupersetCluster
spec:
clusterConfig:
metadataDatabase:
postgresql:
host: postgresql # mandatory
port: 5432 # defaults to some port number - depending on whether tls is enabled
database: superset # mandatory
credentials: postgresql-credentials # mandatory. key username and password
parameters: {} # optional BTreeMap<String, String>
mysql:
host: mysql # mandatory
port: 3306 # defaults to some port number - depending on whether tls is enabled
database: superset # mandatory
credentials: mysql-credentials # mandatory. key username and password
parameters: {} # optional BTreeMap<String, String>
sqlite:
location: /tmp/sqlite/ # optional, defaults to /tmp/sqlite-<some-suffix>/derby.db
generic:
uriSecret: my-secret # mandatory. key uri
# postgresql://{username}:{password}@{host}:{port}/{database}?sslmode=require
kind: HiveCluster
spec:
clusterConfig:
metadataDatabase:
postgresql:
host: postgresql # mandatory
port: 5432 # defaults to some port number - depending on whether tls is enabled
database: druid # mandatory
credentials: postgresql-credentials # mandatory. key username and password
parameters: {} # optional BTreeMap<String, String>
derby:
location: /tmp/derby/ # optional, defaults to /tmp/derby-<some-suffix>/derby.db
# Missing: MS-SQL server, Oracle
generic:
driver: org.postgresql.Driver # mandatory
uri: jdbc:postgresql://postgresql.us-west-2.rds.amazonaws.com:5432/mypgdb # mandatory
credentialsSecret: my-secret # mandatory (?). key username + password
# <property>
# <name>javax.jdo.option.ConnectionURL</name>
# <value>jdbc:postgresql://postgresql.us-west-2.rds.amazonaws.com:5432/mypgdb</value>
# <description>PostgreSQL JDBC driver connection URL</description>
# </property>
# <property>
# <name>javax.jdo.option.ConnectionDriverName</name>
# <value>org.postgresql.Driver</value>
# <description>PostgreSQL metastore driver class name</description>
# </property>
# <property>
# <name>javax.jdo.option.ConnectionUserName</name>
# <value>database_username</value>
# <description>the username for the DB instance</description>
# </property>
# <property>
# <name>javax.jdo.option.ConnectionPassword</name>
# <value>database_password</value>
# <description>the password for the DB instance</description>
# </property>
----
derby:
location: /tmp/my-database/ # optional, defaults to /tmp/derby/{unique_database_name}/derby.db
Loading