In [0]:
%pip install pytest==8.4.2

# Delta Live Tables - Stream-Static Join Example

In this notebook you need to build a Delta Live Tables pipeline that:
- Reads `web_site_events` from an existing Bronze table (streaming source).
- Reads `product` and `user` lookup tables from Silver (static sources).
- Cleans the data by removing null `user_id` values through an `INNER JOIN`.
- Enriches the events with product and user information.
- Outputs a Silver layer table.

We use:
- `dlt.read_stream` for streaming ingestion.
- `dlt.read` for batch/static lookups.
- `@dlt.table` to declare the DLT tables.


> **Key Note:**
> During the notebook you will find a series of function definitions. They're only a guide, so feel free to use or remove them, and create your own pieces of code.

## Silver: Stream-Static Join

Create the **Silver layer** by joining streaming events with static lookup tables:

- `INNER JOIN` with `user` removes any event records where `user_id` is null or does not match.
- `LEFT JOIN` with `product` enriches events with product information.
- The result is a clean, enriched Silver table.

**Important:**  
- Streaming source: `web_site_events_bronze`  
- Static sources: `user_bronze`, `product_bronze`

## Expect output schema
- event_id
- event_timestamp
- event_type
- session_id
- device_type
- endpoint
- referrer_url
- user_id
- user_name
- user_email
- user_phone
- is_active_user
- product_id
- product_name
- product_category
- product_price


In [0]:
import dlt

# ðŸ§° Get configuration from pipeline parameters
catalog = spark.conf.get("catalog")
bronze_schema = spark.conf.get("bronze_schema")
silver_schema = spark.conf.get("silver_schema")

@dlt.table(
    name="web_site_events",
    comment="Silver layer - enriched web site events after stream-static join."
)
def web_site_events_silver():
    """
    Creates the Silver layer Delta Live Table (DLT) for enriched website events.

    This function reads streaming website event data from the Bronze layer and 
    enriches it with static user and product dimension data from the Silver layer. 
    It performs the following transformations:
      - Reads the `web_site_events` stream from the Bronze schema.
      - Reads and cleans the `products` and `users` tables from the Silver schema.
      - Joins user and product data to the events stream.
      - Selects and renames key fields for the Silver layer.

    The resulting table provides a curated, query-ready dataset combining event,
    user, and product information for downstream analytics.

    Returns:
        DataFrame: A transformed and enriched Spark DataFrame containing:
            - event_id
            - event_timestamp
            - event_type
            - session_id
            - device_type
            - endpoint
            - referrer_url
            - user_id
            - user_name
            - user_email
            - user_phone
            - is_active_user
            - product_id
            - product_name
            - product_category
            - product_price
    """

    events_stream = dlt.read_stream(f"{catalog}.{bronze_schema}.web_site_events")

    products = dlt.read(f"{catalog}.{silver_schema}.products") \
        .drop("last_modified") \
        .withColumnRenamed("name", "product_name")

    users = dlt.read(f"{catalog}.{silver_schema}.users")\
        .drop("last_modified") \
        .withColumnRenamed("name", "user_name")

    return (
        events_stream
            .join(users, on="user_id", how="inner")
            .join(products, on="product_id", how="left")
            .selectExpr(
                "event_id",
                "event_timestamp",
                "event_type",
                "session_id",
                "device_type",
                "endpoint",
                "referrer_url",
                "user_id",
                "user_name",
                "email as user_email",
                "phone as user_phone",
                "is_active as is_active_user",
                "product_id",
                "product_name",
                "category as product_category",
                "price as product_price"
            )
    )


# How to Run the DLT Notebook in Databricks

To execute this Delta Live Tables (DLT) notebook:

1. Go to **Jobs & Pipelines** â†’ **ETL Pipeline**.
2. **Disable the Lakehouse Flow editor** option.
3. Give your pipeline a **name**.
4. Select **Serverless** for the cluster type.
5. Set **Pipeline Mode** to **Continuous** for production and **Triggered** for development and test.
6. In **Source Code**, select this notebook you just created.
7. Choose the desired **catalog** (dev or prod) and the **schema** for the Silver layer.
8. Create the 3 required configurations (catalog, bronze_schema, silver_schema)
9. Click **Create** to finish setup.

The pipeline will now run continuously, processing incoming streaming data and creating the Silver table `web_site_events`.


# Tests for Delta Live Tables Pipeline

The following cells contain tests to validate the correctness of the Delta Live Tables pipeline defined above. These tests ensure that the pipeline behaves as expected under various scenarios.

In [0]:
from helpers import test_runner
import os

notebook_path = dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()
os.environ["NOTEBOOK_NAME"] = notebook_path.split("/")[-1]

test_runner.run()