Skip to content

Logging Backend Architecture

Dan Bason edited this page Jan 23, 2023 · 8 revisions

Logging backend architecture

Summary

Opni Logging uses Opensearch as the data store backend for logging data. The system manages this using the Opensearch Kubernetes Operator. The operator controllers are imported into the Opni Manager and run as part of the manager command.

Table of contents

Architecture

High Level Architecture

Reconcilers/Custom resources

OpniOpensearch

This wraps the Opensearch Operator custom resource

Responsibilities:

  1. Generating private key infrastructure using CertManager. The reconciler will create certificate authorities, serving certs, and client auth certs for other reconcilers to use with the Opensearch API
  2. Creating the OpenSearchCluster custom resource that the Opensearch Operator reconciles to create the cluster
  3. Creating additional custom resources that Opni uses to manage operations against the Opensearch API.

MultiClusterRoleBinding

Responsible for configuring Opensearch for Opni Logging

Responsibilities

  1. Generates a client cert for the internal admin user. This used by the management API for querying Opensearch
  2. Binds the internal admin user to the Opensearch all_access role
  3. Creates the role that downstream clusters use to write logs to Opensearch
  4. Creates required ISM templates. These will be created with the desired retention period.
  5. Creates the logging ingest pipeline
  6. Creates required index templates
  7. If indices already exist makes sure the ingest pipeline is the default for those indices
  8. Bootstraps indices. If the index template needs to be updated the process will also reindex logs to the new template.
  9. Sets up index patterns in Opensearch Dashboards if it's enabled

LoggingCluster

Responsible for adding configuration for downstream clusters to Opensearch

Responsibilities

  1. Creates a role that grants read access to logs for the specific logging cluster
  2. Creates a random unique user for the downstream cluster to use to write logs
  3. Binds the user to the log indexing role

Test plan

Testing for reconcilers should be covered by unit testing. This uses the k8s sig testenv to confirm that all expected Kubernetes resources are created. The Opensearch reconcilers operate with a mock to confirm the expected API endpoints are being used.

Manual E2E testing is required for changes to Opensearch reconciliation. Expected behaviour should be confirmed using a deployed Opensearch.

Management API

The Management API is interface for interacting with the OpniOpensearch custom resource in an opinionated manner.

Diagram

Responsibilities

The management API's major transformation is converting the create/update request details into an array of Opensearch Operator NodePool details. By default it will create a single nodepool that has the controlplane, data, and ingest roles. The translation ensures there will always be either 3 or 5 nodes with the controlplane role. If there are less than 3, or less than 5 and an even number or replicas, it will create a separate quorum nodepool to make up the difference. If there are more than 5 it will separate out the extra replicas to an extra nodepool that doesn't contain the controlplane role.

If the separate controlplane option is selected the main nodepool will no longer contain the controlplane role. A separate nodepool of 3 controlplane nodes will be created.

If the separate ingest option is selected the main nodepool will no longer contain the ingest role. A separate nodepool for ingest nodes will be created, and will have the specified number of replicas.

Test Plan

Unit Testing Interactions with Kubernetes are covered by unit testing with the k8s sig testenv.

E2E Testing Basic manual smoke tests should be completed.

Logging cluster driver

The Logging cluster driver is the object that manages Logging specific implementation for interacting with Kubernetes objects required for logging.

Functions

The cluster driver implements the following functions:

CreateCredentials

This generates a random password for the downstream cluster to authenticate to Opensearch with. The password is stored in a Secret and a corresponding LoggingCluster object is created.

This is called by the implementation of the Install CapabilityBackend plugin function.

GetCredentials

Fetches the username and password from the Secret for a cluster ID.

This is called by the implementation of the Sync CapabilityBackend plugin function.

GetExternalURL

Fetches the external URL set when the logging cluster is created. This is stored in the OpniOpensearch object.

This is called by the implementation of the Sync CapabilityBackend plugin function.

GetInstallStatus

Checks whether the OpniOpensearch object exists. This is a method of checking whether the backend has been created.

This is called by the implementation of both he Sync and CanInstall CapabilityBackend plugin functions.

SetClusterStatus

Stores the last sync time and enabled boolean for a downstream cluster. These metadata are stored in the LoggingCluster object

This is called by the implementation of the Sync CapabilityBackend plugin function.

GetClusterStatus

Fetches the last sync time and enabled metadata from the LoggingCluster object.

This is called by the implementation of the Sync CapabilityBackend plugin function.

SetSyncTime

Sets the in memory sync time to the current time stamp

This is called by the implementation of the Sync CapabilityBackend plugin function.

Opensearch Data Manager

The Opensearch Data Manager is the sub component responsible for interacting with the Opensearch API.

Diagram

Responsibilities

The manager contains an asynchronous Opensearch client. It will check Kubernetes for the existence of an Opensearch cluster and if it exists will configure the client for communicating with it. If the Opensearch cluster is deleted via the API it will also unset the client configuration.

It also maintains the current state of operations in a NATS KV store. This is for persistence in the event of the Gateway restarting.

Test Plan

Unit Testing - Opensearch API mocks and embedded NATS server E2E Testing - Correct functioning should be manually reviewed. Using the API cluster should be created and deleted. The Gateway should also be restarted

Data delete implementation

The Logging Gateway plugin implements the delete task interface which allows logging data for a specific cluster to be deleted. The deletion process is managed by the Opensearch Data Manager.

This is part of the implementation of the TaskRunner interface

Diagram

DoClusterDataDelete

DeleteTaskStatus

Scale and performance:

The Opni controllers run with a leader election system so scaling these has no impact on performance. All controllers are created using kube-controller. Some delay is expected when first creating an Opensearch cluster as generating passwords is a time consuming process.

Search performance Opensearch is dependent on the size and scale of nodes with the data role (the main nodes). These will largely be memory constrained.

Indexing performance is dependent on the size and scale of both ingest and data nodes.

Opensearch Controlplane nodes require minimal resources so have explicitly set replicas and resource requests.

TBD - Details on performance testing.

Security:

Interactions with the Opensearch API are a mix of basic and TLS authentication. Reconciler operations against the Opensearch API are authenticated with a custom admin TLS cert generated when the cluster is initialized. The management API uses a user client cert with the required permissions

Downstream cluster index using basic auth, with an account that has index only permissions. Each cluster will have a separate user account.

High availability:

Opensearch controlplane nodes are configured to be highly available always. The other node roles can be scaled as appropriate, however the defaults are to be highly available.

HA is less of a concern for the Kubernetes controllers as these are eventually consistent, however these can be scaled out if high availability is desired.

Clone this wiki locally