Synchronize Databases with Kafka Connect

Real-time synchronization from PostgreSQL to Amazon Redshift with Kafka Connect API in distributed mode, and the same approach can be applied to other database systems.

Link to my presentation.
Link to my live demo on youtube.

Introduction

Kafka Connect API is a core component of Apache Kafka platform, and it provides a scalable and fault-tolerant database synchronization option between various database systems.

In this project, a streaming data pipeline was created with Kafka Connect API to continuously capture any changes in a PostgreSQL database and replicate them into an Amazon Redshift data warehouse.

Architecture

Once the streaming pipeline is constructed, a snapshot of the source PostgreSQL database will be captured, and all data in the source database will be streaming through a Kafka cource connector, a Kafka broker cluster, and a Kafka sink connector into Amazon Redshift data warehouse.

When there are any changes in the source database, the streaming system will capture them and replicate them into the Amazon Redshift data warehouse.

System Setup

Deploy Kafka Brokeer cluster and Kafka Connect cluster on AWS EC2 instances with Ansible Playbook

Prepare Ansible Playbooks for Confluent Platform
- An example of hosts.yml is shown in kafka/cp-ansible/hosts.yml
Install Ansible Playbooks for Confluent Platform
- ansible-playbook -i hosts.yml all.yml
Installing the Redshift JDBC Driver
- Download the latest JDBC 4.0 driver JAR file that comes with the AWS SDK from here.
- Save the JDBC driver JAR file to /<plugin.path>/kafka-connect-jdbc/

Postgres Node

Simply Install: PostgreSQL
- sudo apt update
- sudo apt upgrade -y
- sudo apt install postgresql postgresql-contrib
- sudo service postgresql start

Amazon Redshift

Getting started with Amazon Redshift

Run Demo

Create tables in PostgreSQL database
- python3 /postgres/create_tables.py <db-name> <user> <password> <server-address> <port> <table-name> <sample-file>
Create tables on Amazon Redshift cluster
- python3 /redshift/create_tables.py <db-name> <user> <password> <server-address> <port> <table-name>
Run Kafka source connector
- bash postgresql_source.sh
Run Kafka sink connector
- bash redshift_sink.sh
Load more data into PostgreSQL database
- python3 /postgres/create_tables.py <db-name> <user> <password> <server-address> <port> <table-name> <data-file>

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
images		images
kafka		kafka
postgres		postgres
redshift		redshift
s3		s3
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Synchronize Databases with Kafka Connect

Introduction

Architecture

System Setup

Run Demo

About

Releases

Packages

Languages

wanlipu/insight-de

Folders and files

Latest commit

History

Repository files navigation

Synchronize Databases with Kafka Connect

Introduction

Architecture

System Setup

Run Demo

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages