This repository contains an end-to-end demo demonstrating the integration of Apache Flink and Prometheus using the Flink Prometheus connector.
The demo simulates a fleet of hybrid vehicles continuously emitting several types of telemetry events.
The goal is displaying real-time dashboards and setting up real-time alarms, using Prometheus and Grafana.
To make the telemetry event actionable, the raw events must be pre-processed using Apache Flink, before writing them to Prometheus.
The demo is designed to run on Amazon Managed Service for Apache Flink, Amazon Managed Streaming for Apache Kafka (MSK) and Amazon Managed Service for Prometheus (AMP), but it can be easily adapted to any other Apache Flink, Apache Kafka and Prometheus versions.
The demo includes the following components:
- Vehicle event generator: a Flink application generating simulated connected vehicle data and writing into Kafka.
- Pre-processor: a Flink application reading raw vehicle events from Kafka, pre-processing with enrichment, reducing frequency via aggregation, reducing cardinality, and writing to Prometheus.
- Raw event writer: a second Flink application reading raw vehicle events from Kafka, and writing them, without modifications, to Prometheus.
Example of raw vehicle events generated by Vehicle event generator:
{
"eventType": "IC_RPM",
"vehicleId": "V0000012345",
"region": "UK",
"timestamp": 1730621749,
"measurement": 4356
}
Dimension cardinality:
eventType
IC_RPM
: Internal Combustion engine RPM gaugeELECTRIC_RPM
: Electric motor RPM gauge (the fictional vehicles are hybrid)WARNINGS
: Count warning lights currently on in the vehicle
region
: cardinality = 20vehicleId
: configurable (see Configuring Vehicle event generator)
All events are written to a single Kafka topic, serialized as JSON.
The Pre-processor application simulates the following operations:
- Enrich raw events adding a
vehicleModel
, based on thevehicleId
. In the real world this would happen looking up a reference database, possibly pre-loading or caching the results. For simplicity, in this demo the map
vehicleId
->vehicleModel
is hardwired. - Calculate derived metrics: Is the vehicle in motion? When either
IC_RPM
orELECTRIC_RPM
are not zero. - Reduce frequency and cardinality: Aggregate raw events by
vehicleModel
andregion
over 5 second windows (configurable). - Write the aggregate metrics to Prometheus.
Example of aggregate metric written by Pre-processor to Prometheus:
{
"Labels": [
{ "name": "__name__", "value": "vehicles_in_motion" },
{ "name": "model", "value": "Nebulara" },
{ "name": "region", "value": "US" }
],
"Samples": [
{ "timestamp": 1730621749, "value": 42.0 }
]
}
The two metrics written by Pre-processor to Prometheus are:
vehicles_in_motion
: Count of unique vehicles in motion in the 5 seconds time window, permodel
, perregion
warnings
: Count of vehicle warning lights on in the 5 seconds time window, permodel
, perregion
Other dimensions cardinality:
region
: cardinality = 20 (from the raw events)model
: cardinality = 8 (for simplicity, models are hardwired in the Pre-processor)
You can run the demo on AWS, using Amazon Managed Service for Apache Flink, Amazon Managed Streaming for Apache Kafka (MSK), Amazon Managed Service for Prometheus (AMP), and Amazon Managed Grafana.
Use the provided CDK stack to set up the VPC, the MSK cluster, build and deploy the Managed Service for Apache Flink applications.
Alternatively, you can set up the resources manually, following the step-by-step instructions
Additionally, you create and run 3 Amazon Managed Service for Apache Flink applications:
- Vehicle data generator
- Pre-processor
- Raw event writer
Regardless you create the stack using CDK or manually, some additional steps are required to set up Grafana and the dashboard, following these instructions.
For testing and development, you can run the demo partly locally:
- Run the Flink applications, Vehicle event generator, Pre-processor, and Raw event writer directly in IntelliJ. You do not need to install Apache Flink locally.
- Run Kafka locally using the provided Docker compose stack. This simplifies the development setup not requiring access to MSK from your development machine.
- You can use Amazon Managed Prometheus and Amazon Managed Grafana when running the Flink application locally, without setting up any special connectivity, as long as on your machine you are using AWS credentials with Remote-Write permissions to the AMP workspace.
See Setup for running the demo locally for details.
Runtime configuration of the Flink jobs.
Runtime properties for Vehicle event generator.
To configure the application for running locally modify this JSON file.
Group ID | Key | Default | Description |
---|---|---|---|
KafkaSink |
bootstrap.servers |
N/A | Kafka cluster boostrap servers, for unauthenticated plaintext |
KafkaSink |
topic |
vehicle-events |
Topic name |
DataGen |
vehicles |
1 |
Number of simulated vehicles |
DataGen |
events.per.sec |
n/a | Number of simulated vehicle events generated per second |
DataGen |
prob.motion.state.change |
0.01 |
Probability of each simulated vehicle to change its motion status, every time an event for that vehicle is generated (double, between 0.0 and 1.0 ) |
DataGen |
prob.warning.change |
0.001 |
Probability the number of warning lights will change in each simulated vehicle, every time an event for that vehicle is generated (double, between 0.0 and 1.0 ) |
Note: the "probability" parameters are used to make the randomly generated data a bit more "realistic" but they are not really important for the demo.
Runtime properties for Pre-processor.
To configure the application for running locally modify this JSON file.
Group ID | Key | Default | Description |
---|---|---|---|
KafkaSource |
bootstrap.servers |
N/A | Kafka cluster boostrap servers, for unauthenticated plaintext |
KafkaSource |
topic |
vehicle-events |
Topic name |
KafkaSource |
group.id |
pre-processor |
Consumer Group ID |
KafkaSource |
max.parallelism |
N/A | Max parallelism of the source operator. I can be used to limit the parallelism when the application parallelism is > partitions in the source topic. If not specified, the source uses the application parallelism. |
Aggregation |
window.size.sec |
5 |
Aggregation window, in seconds |
PrometheusSink |
endpoint.url |
N/A | Premetheus Remote-Write endpoint URL |
PrometheusSink |
max.request.retry |
100 |
Max number of retries for write retryable errors (e.g. throttling) before the sink discard the write request and continue. |
Runtime properties for Raw event writer.
To configure the application for running locally modify this JSON file.
Group ID | Key | Default | Description |
---|---|---|---|
KafkaSource |
bootstrap.servers |
N/A | Kafka cluster boostrap servers, for unauthenticated plaintext |
KafkaSource |
topic |
vehicle-events |
Topic name |
KafkaSource |
group.id |
pre-processor |
Consumer Group ID |
KafkaSource |
max.parallelism |
N/A | Max parallelism of the source operator. I can be used to limit the parallelism when the application parallelism is > partitions in the source topic. If not specified, the source uses the application parallelism. |
PrometheusSink |
endpoint.url |
N/A | Premetheus Remote-Write endpoint URL |
PrometheusSink |
max.request.retry |
100 |
Max number of retries for write retryable errors (e.g. throttling) before the sink discard the write request and continue. |
Consult the Amazon Managed Service for Prometheus Quotas for the default maximum Ingestion throughput per workspace.
If the ingestion to Prometheus exceed this limit, the Flink application is throttled. This can easily happen with the Raw event writer. We recommend to create a separate AMP workspace to test writing raw events.
Note that this limit is a soft quota. You can request a quota increase for real environment. Consult the AMP Quota documentation for details.