K8S has five main responsibilities:
- Instantiate, initialize and start the processing for the PE
- Maintain a liveness probe which indicates to Kubernetes the health status of the PE
- Collect and report PE metrics
- Handle subscription changes for any
Export
operators inside of its PE - Handle consistency changes for any consistent region inside of its PE
Liveness, metrics, subscriptions and consistent regions are handled in separate threads that are created before PE processing starts.
This is the most basic task of the K8S component. It goes as follows:
- Load the AADL for the PE
- Instantiate the
PEImpl
object - Call
pe_->process()
- Wait for completion
K8S also catches standard exceptions that could arise withing PEImpl::process
.
The liveness probe communicates health status by maintaining a file
/tmp/healthy
. When that file is present, the PE is healthy. When the PE is not
healthy, it removes the file, and its absence indicates to Kubernetes that the
pod is no longer alive, and should be terminated.
The liveness thread asynchronously receives notifications on PE static connection connection and disconnection events. The main thread continuously monitors overall state changes and considers the PE to be healthy when all static connections are connected, and unhealthy when any static connection is lost.
We are still in the process of determining what the best design for liveness should be. The difficulty with the current approach is that connection state is currently tied to the PE being restarted. We want to separate out PE connection status from liveness. The fundamental issue we need to resolve is that in the Kubernetes way of thinking, the way to recover from an error condition is to restart the component. This approach assumes a stateless view of computation. Statefulness is assumed in the Streams approach, where components are only restarted if they have failed catastrophically or the user has explicitly requested it. The underlying assumption in the Streams world is that restarts cause data loss, which we go to great lenghts to avoid.
Because of this difference in fundamental approaches, we have to be careful in how Knative Streams takes advantage of Kubernetes' services. Both Kubernetes and Streams defined states for their execution atoms. Both definitions are given below with an attempt at a reconciliation.
The source of the information below is located here.
Name | Description |
---|---|
Pending | The Pod has been accepted by the Kubernetes system, but one or more of the Container images has not been created. This includes time before being scheduled as well as time spent downloading images over the network, which could take a while. |
Running | The Pod has been bound to a node, and all of the Containers have been created. At least one Container is still running, or is in the process of starting or restarting. |
Succeeded | All Containers in the Pod have terminated in success, and will not be restarted. |
Failed | All Containers in the Pod have terminated, and at least one Container has terminated in failure. That is, the Container either exited with non-zero status or was terminated by the system. |
Unknown | For some reason the state of the Pod could not be obtained, typically due to an error in communicating with the host of the Pod. |
The source of the information below is located here.
Name | Description |
---|---|
Constructed | The PE is initializing. |
Restarting | The PE is restarting. |
Running | The PE is running. |
Starting | The PE is starting. |
Stopped | The PE is stopped. |
Stopping | The PE is stopping. |
Unknown | The domain controller service cannot be contacted to determine the current state of the PE. |
The actual implementation of the state machine is here.
Kubernetes | Streams |
---|---|
Pending | Constructed |
Running | Running |
Succeeded | Stopped |
Failed | ? |
Unknown | Unknown |
Completed | Stopped |
CrashLoopBackOff | ? |
? | Stopping |
? | Starting |
Processing elements also carry the notion of health, as defined below (from the documentation):
Summary | Description |
---|---|
Healthy | Indicates that the PE is running and all of the required and optional connections are connected. |
PartiallyHealthy | Indicates that the PE is running and all of the required connections are connected, but some of the optional connections are in the process of being connected. |
PartiallyUnhealthy | Indicates that the PE is not stopped or in the process of stopping, but either the state is not running or some required connections are not yet connected. |
Unhealthy | Indicates that the PE is stopped or is in the process of stopping. |
These health states conflate notions of connection status and PE running state. We would like to replace them with states that indicate only connection status.
We define the notion of connectivity to solely describe the status of a PE connections. Connectivity has three states:
Summary | Description |
---|---|
Full | All connections are established. |
Partial | At least one connection is not established. |
None | No connection is established. |
The connectivity of a PE is monitored by the K8SConnectivityThread
. It communicates the PE's
connectivity to the controller by mean of REST PUT
operations on the
/state/job/:name:/pe/:id:/connectivity
API endpoint.
- Use as many native Kubernetes facilities as possible.
- Avoid restarting PEs unless they have crashed catastrophically or explicitly stopped or shutdown from user-code or outside request.
- Keep the concept of PE and pod running status independent of PE-to-PE connection status.
Metrics management is handled by Prometheus for scraping and Grafana for graphing. The prometheus-cpp library is used in the runtime to expose a Prometheus-compatible service in the pods.
Prometheus is the metrics scraper. Its role is to mostly collect metrics, but it offers a crude web interface:
Grafana is the grapher. It interfaces itself with Prometheus and can be configured in many ways to display the collected metrics:
The Kubernetes PE start a metrics collector thread that periodically calls on
the BasePEImpl::getMetrics()
API to collect the PE's metrics. It then converts
those metrics into a format that can be consumed by Prometheus using the
prometheus-cpp
library.
K8S metrics reporting makes heavy use of the Promethesus concept of
labels. Each metric has a
name, but we also add appropriate labels. For example, the metric
pe_input_tuples_processed
metric stored in Promethesus maps to
nTuplesProcessed
for PE input ports as described in the
Streams documentation.
K8S registers a new pe_input_tuples_processed
metric for each PE input port in
the application. Each of these metrics also has two additional labels which
identify it:
pe
: the PE ID that contains the input portpe_port
: the port ID of that input port in the PE.
As a further example, the metric op_input_tuples_submitted
stored in
Prometheus maps to nTuplesSubmitted
for operator output ports. Each of these
metrics has four labels which identify it:
pe
: the PE ID that contains the operatorop
: the operator index within the PEname
: the name of the operatorlogical_name
: the logical name of the operatorop_port
: the index of the output port within the operator
Metrics with the same name are easily graphed by Grafana, and they are distinguished by their labels.
Operator custom metrics create an edge-case, as they can be created by user-code at runtime at any time. At first glance, the Prometheus model works well, as it easily allows multiple operators to create the same metric name, which are then distinguished by their labels. The problem arises in that custom metrics created by different operators do not necessarily represent the same measured quantity. They are not even required to be the same type. It is possible we could encounter a scenario in which two different operators create custom metrics with the same name but different types, and in that case, what will happen is (at the moment) unknown.
The Prometheus-way of storing metrics does cause some awkwardness in K8S. If we imagine the relationship of PEs, PE ports, operators and operator ports as a tree with the appropriate metrics for each entity as a leaf node, K8S must maintain just those leaf nodes. This requirement makes the data structures storing each kind of metric nested in an unintuitive manner, as are the algorithms iterating over these structures.
Prometheus knows about which pod to scrap by using a ServiceMonitor
custom
resource definition. That CRD uses a label selector to that end. The controller
has been modified to add specific labels used by the ServiceMonitor
to collect
the PE pods.
By defaults, Grafana starts up empty. We configure it by using ConfigMap
resources that are mounted in Grafana's provisioning directories. With this
method, both data source and dashboard definitions can be scripted and
automatically generated depending on the user's situation and application.
Grafana also has a plugin API that can be used to provide new types of graphs. With this API, it would be possible to build a graph viewer interface that displays an annotated version of the running graph the same way the Streams console does.
The subscription thread implements the UDP notification and periodic polling for
Export
operator subscripion changes mentioned in the
Import/Export section. Since K8S is the broker between the
PE and any controllers that need to communicate with the PE, K8S must implement
this policy.
The thread loops over a poll()
on a UDP socket. When the thread receives a
packet, or when enough time has passed and the poll()
times out, it uses
curl
to fetch the job subscription board for this specific PE. The thread then
compares the newly fetched job board against the last seen job board, and calls
PEImpl::routingInfoNotificationHandler()
as appropriate to add, delete or
update any entries.
The consistent region thread implements the UDP notification and periodic polling for consistent region changes mentioned in the Consistent Region section. Since K8S is the broker between the PE and any controllers that need to communicate with the PE, K8S must implement this policy.