This pipeline demonstrates how to read change data capture (CDC) data from a MySQL database and replicate the changes to Delta Lake table(s) on Databricks.
For more information, see Loading Data into Databricks Delta Lake in StreamSets Data Collector documentation.
- StreamSets Data Collector 3.15.0 or higher. You can run Data Collector on your cloud provider of choice, or download it for local use.
- Ensure the pre-requisites for Databricks Delta Lake are complete
- MySQL Server with Binary log enabled
- MySQL Connector/J JDBC Driver
- Download the pipeline and import it into Data Collector or Control Hub
- Configure all the pipeline parameters for your MySQL Database and Databricks connections
- If necessary, update the MySQL binlog origin to replicate only specific tables
- By default, the Databricks Delta Lake destination is configured to auto create each table that is replicated from MySQL and write the data in DBFS. If you'd like, update the configurations in the destination per your needs.
- Configure Databricks Delta Lake destination to add a key column for each Delta Lake table being replicated. This is required for ensure the Merge command is run with the right conditional logic for Inserts, Updates and Deletes.
- Start your Databricks cluster.
Start the pipeline. It takes a couple of seconds to create a connection to Databricks. Once the connection is established, you should see records replicated from MySQL and sent to Delta Lake. Insert, update and delete records in MySQL to see how they are being replicated in Delta Lake.