
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png"
    alt="Databricks Learning"
  >
</div>



# Auto Load Data to Bronze

The chief architect has decided that rather than connecting directly to Kafka, a source system will send raw records as JSON files to cloud object storage. In this notebook, you'll build a multiplex table that will ingest these records with Auto Loader and store the entire history of this incremental feed. The initial table will store data from all of our topics and have the following schema. 

| Field | Type |
| --- | --- |
| key | BINARY |
| value | BINARY |
| topic | STRING |
| partition | LONG |
| offset | LONG
| timestamp | LONG |
| date | DATE |
| week_part | STRING |

This single table will drive the majority of the data through the target architecture, feeding three interdependent data pipelines.

<!-- <img src="https://files.training.databricks.com/images/ade/ADE_arch_bronze.png" width="60%" /> -->

**NOTE**: Details on additional configurations for connecting to Kafka are available <a href="https://docs.databricks.com/spark/latest/structured-streaming/kafka.html" target="_blank">here</a>.


## Learning Objectives

By the end of this lesson, you should be able to:
- Describe a multiplex design
- Apply Auto Loader to incrementally process records

In [0]:
from pyspark import pipelines as dp
import pyspark.sql.functions as F

source = spark.conf.get("source")
lookup_db = spark.conf.get("lookup_db")

In [0]:
@dp.table(
    table_properties={
        "pipelines.reset.allowed": "false"
    }
)
def date_lookup():
    return spark.read.table(f"{lookup_db}.date_lookup_clone").select("date", "week_part")


@dp.table(
    partition_cols=["topic", "week_part"],
    table_properties={
        "quality": "bronze",
        "pipelines.reset.allowed": "false"
    }
)
def bronze():
    return (
      spark.readStream
        .format("cloudFiles")
        .schema("key BINARY, value BINARY, topic STRING, partition LONG, offset LONG, timestamp LONG")
        .option("cloudFiles.format", "json")
        .load(f"{source}/daily")
        .join(
          F.broadcast(dp.read("date_lookup")), 
          F.to_date((F.col("timestamp")/1000).cast("timestamp")) == F.col("date"), "left")
    )

In [0]:
@dp.table
def distinct_topics():
    return dp.read("bronze").select("topic").distinct()

&copy; 2026 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br/><br/><a href="https://databricks.com/privacy-policy" target="_blank">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use" target="_blank">Terms of Use</a> | <a href="https://help.databricks.com/" target="_blank">Support</a>