Skip to content

Latest commit

 

History

History

MySQL CDC to Delta Lake

MySQL CDC to Delta Lake

This pipeline demonstrates how to read change data capture (CDC) data from a MySQL database and replicate the changes to Delta Lake table(s) on Databricks.

For more information, see Loading Data into Databricks Delta Lake in StreamSets Data Collector documentation.

Prerequisites

Setup

  • Download the pipeline and import it into Data Collector or Control Hub
  • Configure all the pipeline parameters for your MySQL Database and Databricks connections
  • If necessary, update the MySQL binlog origin to replicate only specific tables
  • By default, the Databricks Delta Lake destination is configured to auto create each table that is replicated from MySQL and write the data in DBFS. If you'd like, update the configurations in the destination per your needs.
  • Configure Databricks Delta Lake destination to add a key column for each Delta Lake table being replicated. This is required for ensure the Merge command is run with the right conditional logic for Inserts, Updates and Deletes.
  • Start your Databricks cluster.

Running the Pipeline

Start the pipeline. It takes a couple of seconds to create a connection to Databricks. Once the connection is established, you should see records replicated from MySQL and sent to Delta Lake. Insert, update and delete records in MySQL to see how they are being replicated in Delta Lake.

Pipeline running