Skip to content

aws-samples/msk-flink-streaming-cdk

Real-Time Streaming with Amazon Managed Streaming for Apache Kafka (Amazon MSK) and Amazon Managed Service for Apache Flink


Stability: Developer Preview

This is an experimental example. It may not build out of the box

This example is built on Construct Libraries marked "Developer Preview" and may not be updated for latest breaking changes.

It may additionally requires infrastructure prerequisites that must be created before successful build.

If build is unsuccessful, please create an issue so that we may debug the problem


Overview

Based on the following AWS Blog: https://aws.amazon.com/blogs/big-data/build-a-real-time-streaming-application-using-apache-flink-python-api-with-amazon-kinesis-data-analytics/

Architecture Diagram

This repository provides two examples of running a Python-based Apache Flink application using Amazon Managed Service for Apache Flink with stateful processing. We use custom code to generate random telemetry data that includes sensor ID, temperature, and event time. The first use case demonstrates sending a notification when the count of high temperature readings of a sensor exceeds a defined threshold within a window (for this post, 30 seconds). The second use case calculates the average temperature of the sensors within a fixed window (30 seconds), and persists the results in Amazon Simple Storage Service (Amazon S3) partitioned by event time for efficient query processing.

The workflow includes the following steps:

  1. An Amazon CloudWatch event triggers an AWS Lambda function every minute.
  2. The Lambda function generates telemetry data and sends the data to Amazon Managed Streaming for Apache Kafka (Amazon MSK). With IAM Access Control.
  3. The data is processed by an Apache Flink Python application hosted on Amazon Managed Service for Apache Flink. With IAM Access Control.
  4. After processing, data with average temperature calculation is stored in Amazon S3 and data with anomaly results is sent to the output topic of the same MSK cluster.
  5. The Lambda function monitors the output stream, and processes and sends data to the appropriate destination—for this use case, Amazon Simple Notification Service (Amazon SNS).

Additions:

  • IAM Access Control for authentication and authorization to the MSK cluster from Lambda and Managed Flink.
  • Enabled for Mutual TLS authentication
  • Updated to Flink 1.13.6
  • Enables MSK Multi-VPC connectitvity

Prerequisites

  • Maven
  • (optional) An AWS Private Certificate Authority

To Run:

  1. Install the required dependencies:
pip install -r requirements.txt
  1. Build the jar file for the Flink SQL Connector with IAM Auth
cd JarPackaging
mvn clean package
cd ..
cp JarPackaging/target/aws-iam-sql-kafka-connector-1.jar PythonKafkaSink/lib
zip -r PythonKafkaSink.zip PythonKafkaSink/
  1. Bootstrap AWS Account for CDK:
cdk bootstrap
  1. Deploy the stack (30-40 mins):
cdk deploy

or if using TLS with a Private CA:

cdk deploy --parameters privateCaArn=arn:aws:acm-pca:<AWS_REGION>:<AWS_ACCOUNT_ID>:certificate-authority/XXXXXX-XXXX-XXXX-XXXXXX-XXXXXXXX

Authentication and authorization

IAM Access Control

Follow instructions here

Mutual TLS Authentication

Follow instructions here.

For step 4 On Amazon Linux 2, truststore can be copied as:

cp /usr/lib/jvm/java-11-amazon-corretto.x86_64/lib/security/cacerts kafka.client.truststore.jks

About

No description, website, or topics provided.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages