-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Materialized Gold Tables

Because the lakehouse combines on-demand compute resources with infinitely scalable cloud object storage to optimize cost and performance, the concept of a materialized view most closely maps to that of a gold table. Rather than caching the results to the view for quick access, results are stored in Delta Lake for efficient deserialization.

**NOTE**: Databricks SQL leverages <a href="https://docs.databricks.com/sql/admin/query-caching.html#query-caching" target="_blank">Delta caching and query caching</a>, so subsequent execution of queries will use cached results.

Gold tables refer to highly refined, generally aggregate views of the data persisted to Delta Lake.

These tables are intended to drive core business logic, dashboards, and applications.

The necessity of gold tables will evolve over time; as more analysts and data scientists use your Lakehouse, analyzing query history will reveal trends in how data is queried, when, and by whom. Collaborating across teams, data engineers and platform admins can define SLAs to make highly valuable data available to teams in a timely fashion, all while cutting down the potential costs and latency associated with larger ad hoc queries.

In this notebook, we'll create a gold table that stores summary statistics about each completed workout alongside binned demographic information. In this way, our application can quickly populate statistics about how other users performed on the same workouts.

<img src="https://files.training.databricks.com/images/ade/ADE_arch_bpm_summary.png" width="60%" />

## Learning Objectives
By the end of this lesson, students will be able to:
- Describe performance differences between views and tables
- Implement a streaming aggregate table

## Setup
Set up path and checkpoint variables (these will be used later).

In [0]:
%run ../Includes/Classroom-Setup-5.2

## Explore Workout BPM
Recall that our **`workout_bpm`** table has already matched all completed workouts to user bpm recordings.

Explore this data below.

In [0]:
%sql
SELECT * 
FROM workout_bpm
LIMIT 10

user_id,workout_id,session_id,time,heartrate
36117,5,314,2019-12-02T07:10:51.000+0000,125.26368756909096
36117,5,314,2019-12-02T07:16:20.000+0000,147.90040295844068
36117,5,314,2019-12-02T07:26:51.000+0000,139.2137133879535
36117,5,314,2019-12-02T07:35:44.000+0000,88.29318327972032
38766,30,402,2019-12-02T07:36:41.000+0000,97.20195580865813
12474,45,3,2019-12-02T07:40:08.000+0000,107.0481315632618
14232,32,474,2019-12-02T07:53:32.000+0000,144.145340797465
38766,30,402,2019-12-02T07:55:24.000+0000,108.71960338500053
38766,30,402,2019-12-02T08:00:52.000+0000,108.38694796645322
12474,45,3,2019-12-02T08:01:25.000+0000,63.47976032714755


Here we calculate some summary statistics for our workouts.

In [0]:
%sql
SELECT user_id, workout_id, session_id, MIN(heartrate) min_bpm, MEAN(heartrate) avg_bpm, MAX(heartrate) max_bpm, COUNT(heartrate) num_recordings
FROM workout_bpm
GROUP BY user_id, workout_id, session_id

user_id,workout_id,session_id,min_bpm,avg_bpm,max_bpm,num_recordings
12474,45,3,57.11846349626094,92.34721283552928,123.77188205079553,230
14508,47,393,58.22173113661009,93.50938540279812,113.49901140031896,193
16093,42,493,71.08167055818177,110.39402171061992,166.2714795154939,320
28588,21,2,81.74440849450012,112.60388589497184,160.06707482451796,128
24863,43,1,55.728371077283605,93.67163937906588,113.11578195451337,141
42387,8,1,60.63179276381176,98.3072480126614,139.26758708521538,449
24863,47,2,55.675317173446096,92.41404290112808,105.55667704065402,166
35728,31,405,54.25373956280828,111.9039894644047,147.23271728707255,282
27306,47,163,68.31222099481197,108.15629046356788,127.72139717731947,179
33987,35,487,70.5887871847356,93.42879200506444,130.70680703681953,346


And now we can use our **`user_lookup`** table to match this back to our binned demographic information.

In [0]:
%sql
SELECT workout_id, session_id, a.user_id, age, gender, city, state, min_bpm, avg_bpm, max_bpm, num_recordings
FROM user_bins a
INNER JOIN 
  (SELECT user_id, workout_id, session_id, 
          min(heartrate) AS min_bpm, 
          mean(heartrate) AS avg_bpm,
          max(heartrate) AS max_bpm, 
          count(heartrate) AS num_recordings
   FROM workout_bpm
   GROUP BY user_id, workout_id, session_id) b
ON a.user_id = b.user_id

workout_id,session_id,user_id,age,gender,city,state,min_bpm,avg_bpm,max_bpm,num_recordings
45,3,12474,75-85,M,San Fernando,CA,57.11846349626094,92.34721283552928,123.77188205079553,230
47,393,14508,85-95,M,Sierra Madre,CA,58.22173113661009,93.50938540279812,113.49901140031896,193
42,493,16093,35-45,F,Northridge,CA,71.08167055818177,110.39402171061992,166.2714795154939,320
21,2,28588,35-45,M,Arcadia,CA,81.74440849450012,112.60388589497184,160.06707482451796,128
43,1,24863,85-95,M,Long Beach,CA,55.728371077283605,93.67163937906588,113.11578195451337,141
8,1,42387,55-65,F,Playa Vista,CA,60.63179276381176,98.3072480126614,139.26758708521538,449
47,2,24863,85-95,M,Long Beach,CA,55.675317173446096,92.41404290112808,105.55667704065402,166
31,405,35728,55-65,F,Montebello,CA,54.25373956280828,111.9039894644047,147.23271728707255,282
47,163,27306,55-65,F,Edwards,CA,68.31222099481197,108.15629046356788,127.72139717731947,179
35,487,33987,55-65,F,Canyon Country,CA,70.5887871847356,93.42879200506444,130.70680703681953,346


## Perform an Incremental Batch Table Update
Because our **`workout_bpm`** table was written as an append-only stream, we can update our aggregation using a streaming job as well.

In [0]:
(spark.readStream
      .table("workout_bpm")
      .createOrReplaceTempView("TEMP_workout_bpm"))

Using trigger-available-now logic with Delta Lake, we can ensure that we'll only calculate new results if records have changed in the upstream source tables.

In [0]:
user_bins_df = spark.sql("""
    SELECT workout_id, session_id, a.user_id, age, gender, city, state, min_bpm, avg_bpm, max_bpm, num_recordings
    FROM user_bins a
    INNER JOIN
      (SELECT user_id, workout_id, session_id, 
              min(heartrate) AS min_bpm, 
              mean(heartrate) AS avg_bpm, 
              max(heartrate) AS max_bpm, 
              count(heartrate) AS num_recordings
       FROM TEMP_workout_bpm
       GROUP BY user_id, workout_id, session_id) b
    ON a.user_id = b.user_id
    """)

(user_bins_df
     .writeStream
     .format("delta")
     .option("checkpointLocation", f"{DA.paths.checkpoints}/workout_bpm_summary")
     .option("path", f"{DA.paths.user_db}/workout_bpm_summary.delta")
     .outputMode("complete")
     .trigger(availableNow=True)
     .table("workout_bpm_summary")
     .awaitTermination())

## Query Results

Note that the primary benefit to scheduling updates to gold tables as opposed to defining views is the ability to control costs associated with materializing results.

While returning results from this table will use some compute to scan the **`workout_bpm_summary`** table, this design avoids having to scan and join files from multiple tables every time this data is referenced.

In [0]:
%sql
SELECT * FROM workout_bpm_summary

workout_id,session_id,user_id,age,gender,city,state,min_bpm,avg_bpm,max_bpm,num_recordings
3,83,26285,45-55,M,Beverly Hills,CA,37.05482849444279,113.23867744062689,133.54441348291,166
3,416,14508,85-95,M,Sierra Madre,CA,64.15209715278007,93.99203734055368,117.71193587065262,154
40,486,36164,85-95,F,Carson,CA,61.2935461663849,84.06623188970703,97.90920460781857,64
10,93,40872,35-45,M,Los Angeles,CA,56.30002261623926,114.19709480408216,149.01681342604203,256
10,319,34740,25-35,M,Santa Monica,CA,64.0610237063677,118.6055693600998,157.7896044228981,243
13,3,43104,45-55,F,Long Beach,CA,76.27838862371735,107.99528974128064,144.3507458165889,333
35,174,27306,55-65,F,Edwards,CA,68.50143158890829,90.9597633275273,130.43088939163587,358
18,487,14232,35-45,M,North Hollywood,CA,75.477955601599,96.46158296973005,128.1041156422313,129
1,56,37012,95+,M,Torrance,CA,51.21193883454026,76.10179850421733,113.42256961244996,409
27,305,29213,95+,F,North Hollywood,CA,55.764492065796944,87.34860751358602,113.69300268327184,308


Run the following cell to delete the tables and files associated with this lesson.

In [0]:
DA.cleanup()

-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>