Important: These instructions assume you have access to StreamSets Data Collector (v3.15+) and have performed all the prerequisites for SQL Server and Databricks Delta Lake
- For help installing StreamSets Data Collector, see StreamSets Data Collector Installation.
- Your SQL Server database is enabled for Change Data Capture (CDC). For help with enabling CDC, see About CDC - SQL Server.
- For help with Databricks Delta Lake, see Delta Lake Prerequisites.
This pipeline demonstrates how to read change data capture (CDC) data from a SQL Server database and replicate the changes to Databricks Delta Lake. The pipeline assumes the following:
- SQL Server database is enabled for Change Data Capture.
- All tables in the database will be tracked for ingesting changes (Inserts, Updates and Deletes). If you need to track only certain tables, configure the Capture Instance Name accordingly in the origin.
- Each source table will be mapped to it's corresponding table in Delta Lake. If the table doesn't exist in Delta Lake, it will be auto-created.
- The SQL Server CDC client generates records for multiple transaction types. The transaction type is captured in the record header as an attribute called sdc.operation.type:
- 1 for Insert
- 2 for Delete
- 3 for Update, including updates captured after the update operation
- 5 for unsupported operations, including updates captured before the update operation.
The Databricks Delta Lake destination handles all but one operation type, which is sdc.operation.type=5. The Stream Selector processor in the pipeline routes and discards all records with sdc.operation.type=5.
Disclaimer: This pipeline is meant to serve as a template for performing SQL Server CDC to Databricks Delta Lake. Some of the parameters, tables and fields may be different for your environment and may need additional customizations. Please consult the StreamSets documentation for full information on configuration of each stage used below.
Databricks Delta Lake Destination
Click Here to download the pipeline and save it to your drive.
Click the down arrow next to the "Create New Pipeline" and select "Import Pipeline from archive".
Click "Browse" and locate the pipeline file you just downloaded, then click "Import"
Click on the pipeline you just imported to open it and click on the "Parameters" tab and fill in the appropriate information for your environment.
Important: The pipeline template uses the most common default settings for things like the Delta Lake staging location, etc. All of these are configurable and if you need to change those, you can opt to not use the built-in parameters and choose the appropriate settings yourself. Please refer to the documentation listed in this document for all the available options.
The following parameters are set up for this pipeline:
sqlserver_jdbc_url
|
JDBC URL to connect to the SQL Server database. |
sqlserver_username
|
SQL Server username. |
sqlserver_password
|
SQL Server password. Tip: To secure sensitive information such as user tokens and passwords, you can use runtime resources or credential stores. |
databricks_jdbc_url
|
JDBC URL used to connect to the Databricks cluster |
databricks_token
|
Personal access token used to connect to the Databricks cluster |
deltalake_database
|
Databricks database name for all tables |
deltalake_s3_bucket
|
S3 bucket for staging data before invoking COPY/MERGE command. |
s3_access_key
|
AWS access key. Leave empty if you enable IAM Roles |
s3_secret_key
|
AWS secret key. Leave empty if you enable IAM Roles |
Configure each table's Key Column in the Delta Lake destination. For CDC data, the MERGE command is applied on Delta Lake requiring columns for matching each table. See details here
Click the "START" button to run the pipeline.