Skip to content

Latest commit

 

History

History

SQLServer CDC to Delta Lake

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

StreamSets Logo

SQL Server CDC to Databricks Delta Lake

SQL Server CDC to Databricks Delta Lake

Important: These instructions assume you have access to StreamSets Data Collector (v3.15+) and have performed all the prerequisites for SQL Server and Databricks Delta Lake

OVERVIEW

This pipeline demonstrates how to read change data capture (CDC) data from a SQL Server database and replicate the changes to Databricks Delta Lake. The pipeline assumes the following:

  • SQL Server database is enabled for Change Data Capture.
  • All tables in the database will be tracked for ingesting changes (Inserts, Updates and Deletes). If you need to track only certain tables, configure the Capture Instance Name accordingly in the origin.
  • Each source table will be mapped to it's corresponding table in Delta Lake. If the table doesn't exist in Delta Lake, it will be auto-created.
  • The SQL Server CDC client generates records for multiple transaction types. The transaction type is captured in the record header as an attribute called sdc.operation.type:
    • 1 for Insert
    • 2 for Delete
    • 3 for Update, including updates captured after the update operation
    • 5 for unsupported operations, including updates captured before the update operation.
      The Databricks Delta Lake destination handles all but one operation type, which is sdc.operation.type=5. The Stream Selector processor in the pipeline routes and discards all records with sdc.operation.type=5.

Disclaimer: This pipeline is meant to serve as a template for performing SQL Server CDC to Databricks Delta Lake. Some of the parameters, tables and fields may be different for your environment and may need additional customizations. Please consult the StreamSets documentation for full information on configuration of each stage used below.

PIPELINE

Pipeline

DOCUMENTATION

SQL Server CDC Origin

Stream Selector Processor

Databricks Delta Lake Destination

STEP-BY-STEP

Step 1: Download the pipeline

Click Here to download the pipeline and save it to your drive.

Step 2: Import the pipeline

Click the down arrow next to the "Create New Pipeline" and select "Import Pipeline from archive".

Step 2

Click "Browse" and locate the pipeline file you just downloaded, then click "Import"

Step 2a

Step 3: Configure the parameters

Click on the pipeline you just imported to open it and click on the "Parameters" tab and fill in the appropriate information for your environment.

Important: The pipeline template uses the most common default settings for things like the Delta Lake staging location, etc. All of these are configurable and if you need to change those, you can opt to not use the built-in parameters and choose the appropriate settings yourself. Please refer to the documentation listed in this document for all the available options.

Step 3

The following parameters are set up for this pipeline:

sqlserver_jdbc_url JDBC URL to connect to the SQL Server database.
sqlserver_username SQL Server username.
sqlserver_password SQL Server password.
Tip: To secure sensitive information such as user tokens and passwords, you can use runtime resources or credential stores.
databricks_jdbc_url JDBC URL used to connect to the Databricks cluster
databricks_token Personal access token used to connect to the Databricks cluster
deltalake_database Databricks database name for all tables
deltalake_s3_bucket S3 bucket for staging data before invoking COPY/MERGE command.
s3_access_key AWS access key. Leave empty if you enable IAM Roles
s3_secret_key AWS secret key. Leave empty if you enable IAM Roles

Step 4: Configure Key Columns in Delta Lake destination

Configure each table's Key Column in the Delta Lake destination. For CDC data, the MERGE command is applied on Delta Lake requiring columns for matching each table. See details here

Step 4

Step 5: Run the pipeline

Click the "START" button to run the pipeline.

Step 5

Step 6: Make changes to the SQL Server source tables and see the pipeline process them

Step 6