Real-time synchronization from PostgreSQL to Amazon Redshift with Kafka Connect API in distributed mode, and the same approach can be applied to other database systems.
Link to my presentation.
Link to my live demo on youtube.
Kafka Connect API is a core component of Apache Kafka platform, and it provides a scalable and fault-tolerant database synchronization option between various database systems.
In this project, a streaming data pipeline was created with Kafka Connect API to continuously capture any changes in a PostgreSQL database and replicate them into an Amazon Redshift data warehouse.
Once the streaming pipeline is constructed, a snapshot of the source PostgreSQL database will be captured, and all data in the source database will be streaming through a Kafka cource connector, a Kafka broker cluster, and a Kafka sink connector into Amazon Redshift data warehouse.
When there are any changes in the source database, the streaming system will capture them and replicate them into the Amazon Redshift data warehouse.
Deploy Kafka Brokeer cluster and Kafka Connect cluster on AWS EC2 instances with Ansible Playbook
- Prepare Ansible Playbooks for Confluent Platform
- An example of
hosts.yml
is shown inkafka/cp-ansible/hosts.yml
- An example of
- Install Ansible Playbooks for Confluent Platform
ansible-playbook -i hosts.yml all.yml
- Installing the Redshift JDBC Driver
- Download the latest JDBC 4.0 driver JAR file that comes with the AWS SDK from here.
- Save the JDBC driver JAR file to
/<plugin.path>/kafka-connect-jdbc/
Postgres Node
- Simply Install: PostgreSQL
sudo apt update
sudo apt upgrade -y
sudo apt install postgresql postgresql-contrib
sudo service postgresql start
Amazon Redshift
- Create tables in PostgreSQL database
python3 /postgres/create_tables.py <db-name> <user> <password> <server-address> <port> <table-name> <sample-file>
- Create tables on Amazon Redshift cluster
python3 /redshift/create_tables.py <db-name> <user> <password> <server-address> <port> <table-name>
- Run Kafka source connector
bash postgresql_source.sh
- Run Kafka sink connector
bash redshift_sink.sh
- Load more data into PostgreSQL database
python3 /postgres/create_tables.py <db-name> <user> <password> <server-address> <port> <table-name> <data-file>