Skip to content

rockset/dbt-rockset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dbt-rockset

The dbt-Rockset adapter brings real-time analytics to dbt. Using the adapter, you can load data into Rockset and create collections, by writing SQL SELECT statements in dbt. These collections can then be built on top of each other to support highly-complex data transformations with many dependency edges.

The following subsections describe the adapter's installation procedure and support for dbt:

See the following blogs for additional information:

Installation and Set up

The following subsections describe how to set up and use the adapter:

See the adapter's GitHub repo for additional information.

Install the Plug-in

Open a command-line window and run the following command to install the adapter:

pip3 install dbt-rockset

Configure your Profile

Configure a dbt profile similar to the example shown below, to connect with your Rockset account. Enter any workspace that you’d like your dbt collections to be created in, and any Rockset API key. The database field is required by dbt but unused in Rockset.

rockset:
  outputs:
    dev:
      type: rockset
      threads: 1
      database: N/A
      workspace: <rockset_workspace_name>
      api_key: <rockset_api_key>
      api_server: <rockset_api_server> # Optional, default is `api.usw2a1.rockset.com`, the api server for region Oregon.
      vi_rrn: <rockset_virtual_instance_rrn> # Optional, the VI to use for IIS queries
      run_async_iis: <async_iis_queries> # Optional, by default false, whether use async execution for IIS queries
  target: dev

Update your dbt project to use this Rockset dbt profile. You can switch profiles in your project by editing the dbt_project.yml file.

Supported Materializations

Type Supported? Details
Table YES Creates a Rockset collection.
View YES Creates a Rockset view.
Ephemeral Yes Create a CTE.
Incremental YES Creates a Rockset collection if it doesn't exist, and writes to it.

Query Lambda Configuration

Query Lambdas can be created and updated using dbt. To manage a Query Lamdba using dbt, a materialization of 'query_lambda' should be used. For example,

{{
    config(
        materialized='query_lambda',
        tags=['example_tag'],
        parameters=[
            {'name': 'order_id', 'type': 'string', 'value': 'xyz'},
            {'name': 'limit', 'type': 'int', 'value': '10'},
        ]
    )
}}

select * from {{ ref('orders') }} 
where order_id = :order_id
order by order_time
limit :limit

See the tests for more example usages.

⚠️ Query Lambdas cannot be referenced as a model in other dbt models as they cannot be executed from dbt.

Real-Time Streaming ELT Using dbt + Rockset

As data is ingested, Rockset performs the following:

  • The data is automatically indexed in at least three different ways using Rockset’s Converged Index™ technology.
  • Your write-time data transformations are performed.
  • The data is made available for queries within seconds.

When you execute queries on that data, Rockset leverages those indexes to complete any read-time data transformations you define using dbt, with sub-second latency.

Write-Time Data Transformations Using Rollups and Ingest Transformation

Rockset can extract and load semi-structured data from multiple sources in real-time. For high-velocity data (e.g. data streams), you can roll it up at write-time. For example, when you have streaming data coming in from Kafka or Kinesis, you can create a Rockset collection for each data stream, and then set up rollups, to perform transformations and aggregations on the data as it is written into Rockset. This can help to:

  • Reduce the size of large scale data streams.
  • De-duplicate data.
  • Partition your data.

Collections can also be created from other data sources including:

  • Data lakes (e.g., S3 or GCS).
  • NoSQL databases (e.g., DynamoDB or MongoDB)
  • Relational databases (e.g., PostgreSQL or MySQL).

You can then use Rocket’s ingest transformation to transform the data using SQL statements as it is written into Rockset.

Read-Time Data Transformations Using Rockset Views

The adapter can set up data transformations as SQL statements in dbt, using View Materializations that can be performed during read-time.

To set this up:

  1. Create a dbt model using SQL statements for each transformation you want to perform on your data.
  2. Execute dbt run. dbt will automatically create a Rockset View for each dbt model, which performs all the data transformations when queries are executed.

If queries complete within your latency requirements, then you have achieved the gold standard of real-time data transformations: Real-Time Streaming ELT.

Your data will be automatically kept up-to-date in real-time, and reflected in your queries. There is no need for periodic batch updates to “refresh” your data. You will not need to execute dbt run again after the initial set up, unless you want to make changes to the actual data transformation logic (e.g. adding or updating dbt models).

Persistent Materializations Using dbt + Rockset

If write-time transformations and views don't meet your application’s latency requirements (or your data transformations become too complex), you can persist them as Rockset collections.

Rockset requires synchronous queries to complete in under two minutes to cater to real-time use cases, which may affect you if your read-time transformations are too complicated. This requires a batch ELT workflow to manually execute dbt run each time you want to update your data transformations. You can use micro-batching to frequently run dbt, to keep your transformed data up-to-date in near real-time.

Persistent materializations are both faster to query and better at handling query concurrency, as they are materialized as collections in Rockset. Since the bulk of the data transformations have already been performed ahead of time, your queries will complete significantly faster because you can minimize the complexity necessary during read-time.

There are two persistent materializations available in dbt:

Materializing dbt Incremental Models in Rockset

Incremental Models enable you to insert or update documents into a Rockset collection since the last time dbt was run. This can significantly reduce the build time since Rockset only needs to perform transformations on the new data that was just generated, rather than dropping, recreating, and performing transformations on the entire data set.

Depending on the complexity of your data transformations, incremental materializations may not always be a viable option to meet your transformation requirements. Incremental materializations are best suited for event or time-series data streamed directly into Rockset. To tell dbt which documents it should perform transformations on during an incremental run, provide SQL that filters for these documents using the is_incremental() macro in your dbt code. You can learn more about configuring incremental models in dbt here.

Materializing dbt Table Models in Rockset

A Table Model is a transformation that drops and recreates an entire Rockset collection with the execution of dbt. It updates that collection's transformed data with the most up-to-date source data. This is the simplest way to persist transformed data in Rockset, and results in much faster queries since the transformations are completed prior to query time.

However, Table Models can be slow to complete, since Rockset is not optimized for creating entirely new collections from scratch on the fly. This may significantly increase your data latency, as it may take several minutes for Rockset to provision resources for a new collection and then populate it with transformed data.

Putting It All Together

You can use Table Models and Incremental Models (in conjunction with Rockset views), to customize the perfect stack to meet the unique requirements of your data transformations. For example, you can use SQL-based rollups to:

  • Transform your streaming data during write-time.
  • Transform and persist them into Rockset collections via Incremental or Table Models.
  • Execute a sequence of view models during read-time to transform your data again.

Testing, Formatting, & Caveats

Testing Changes

Before landing a commit, ensure that your changes pass tests. Your credentials should be set in the .env file in the repo root. See the test.env file for an example. Once the .env file is set, you can run the tests using the following command.

pytest -s tests/functional

Formatting

Before landing a commit format changes using black.

# Install
python -m pip install black
# Usage (From repo root)
python -m black .

Caveats

  1. unique_key is not supported with incremental, unless it is set to _id, which acts as a natural unique_key in Rockset anyway.
  2. The table materialization is slower in Rockset than most due to Rockset's architecture as a low-latency, real-time database. Creating new collections requires provisioning hot storage to index and serve fresh data, which takes about a minute.
  3. Rockset queries have a two-minute timeout unless run asynchronously. You can extend this limit to 30 minutes by setting the run_async_iis to true. However, if the query ends up in the queue because you have hit your org's Concurrent Query Execution Limit (CQEL), the query must at least start execution before 2 minutes have passed. Otherwise, your IIS query will error. If the query leaves the queue and begins execution before 2 minutes have passed, the normal 30 minute time limit will apply.