d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Raw to Bronze Pattern

## Notebook Objective

In this notebook we:
1. Ingest Raw Data
2. Augment the data with Ingestion Metadata
3. Batch write the augmented data to a Bronze Table

## Step Configuration

In [0]:
%run ./includes/configuration

### Display the Files in the Raw Path

In [0]:
display(dbutils.fs.ls(rawPath))

## Make Notebook Idempotent

In [0]:
#dbutils.fs.rm(bronzePath, recurse=True)

## Ingest raw data

Next, we will read files from the source directory and write each line as a string to the Bronze table.

🤠 You should do this as a batch load using `spark.read`

Read in using the format, `"text"`, and using the provided schema.

In [0]:
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql.functions import explode

movie_path = "/FileStore/movie/*.json"

movie_schema = StructType([
  StructField("movie", ArrayType(StructType([
      StructField("BackdropUrl",StringType(),True),
      StructField("Budget",StringType(),True),
      StructField("CreatedBy",StringType(),True),
      StructField("CreatedDate",StringType(),True),
      StructField("Id",StringType(),True),
      StructField("ImdbUrl",StringType(),True),
      StructField("OriginalLanguage",StringType(),True),
      StructField("Overview",StringType(),True),
      StructField("PosterUrl",StringType(),True),
      StructField("Price",StringType(),True),
      StructField("ReleaseDate",StringType(),True),
      StructField("Revenue",StringType(),True),
      StructField("RunTime",StringType(),True),
      StructField("Tagline",StringType(),True),
      StructField("Title",StringType(),True),
      StructField("TmdbUrl",StringType(),True),
      StructField("UpdatedBy",StringType(),True),
      StructField("UpdatedDate",StringType(),True),
      StructField("geres",ArrayType(
        StructType([
        StructField("id", StringType(),True),
        StructField("name", StringType(), True)]),True), True)]),True), True)
  ])

movie_data_df = (
  spark.read.format("json").option("multiline","true").schema(movie_schema).load(path = movie_path)
)
movie_data_df = (movie_data_df.select(explode(movie_data_df.movie)))


## Display the Raw Data

🤓 Each row here is a raw string in JSON format, as would be passed by a stream server like Kafka.

In [0]:
display(movie_data_df.limit(10))

col
"List(https://image.tmdb.org/t/p/original//s3TBrRGB1iav7gFOCNx3H31MoES.jpg, 1.6E8, null, 2021-04-03T16:51:30.1633333, 1, https://www.imdb.com/title/tt1375666, en, Cobb, a skilled thief who commits corporate espionage by infiltrating the subconscious of his targets is offered a chance to regain his old life as payment for a task considered to be impossible: ""inception"", the implantation of another person's idea into a target's subconscious., https://image.tmdb.org/t/p/w342//9gk7adHYeDvHkCSEqAvQNLV5Uge.jpg, 9.9, 2010-07-15T00:00:00, 8.25532764E8, 148, Your mind is the scene of the crime., Inception, https://www.themoviedb.org/movie/27205, null, null, null)"
"List(https://image.tmdb.org/t/p/original//xJHokMbljvjADYdit5fK5VQsXEG.jpg, 1.65E8, null, 2021-04-03T16:51:30.1633333, 2, https://www.imdb.com/title/tt0816692, en, The adventures of a group of explorers who make use of a newly discovered wormhole to surpass the limitations on human space travel and conquer the vast distances involved in an interstellar voyage., https://image.tmdb.org/t/p/w342//gEU2QniE6E77NI6lCU6MxlNBvIx.jpg, 9.9, 2014-11-05T00:00:00, 6.75120017E8, 169, Mankind was born on Earth. It was never meant to die here., Interstellar, https://www.themoviedb.org/movie/157336, null, null, null)"
"List(https://image.tmdb.org/t/p/original//hkBaDkMWbLaf8B1lsWsKX7Ew3Xq.jpg, 1.85E8, null, 2021-04-03T16:51:30.1633333, 3, https://www.imdb.com/title/tt0468569, en, Batman raises the stakes in his war on crime. With the help of Lt. Jim Gordon and District Attorney Harvey Dent, Batman sets out to dismantle the remaining criminal organizations that plague the streets. The partnership proves to be effective, but they soon find themselves prey to a reign of chaos unleashed by a rising criminal mastermind known to the terrified citizens of Gotham as the Joker., https://image.tmdb.org/t/p/w342//qJ2tW6WMUDux911r6m7haRef0WH.jpg, 9.9, 2008-07-16T00:00:00, 1.004558444E9, 152, Why So Serious?, The Dark Knight, https://www.themoviedb.org/movie/155, null, null, null)"
"List(https://image.tmdb.org/t/p/original//en971MEXui9diirXlogOrPKmsEn.jpg, 5.8E7, null, 2021-04-03T16:51:30.1633333, 4, https://www.imdb.com/title/tt1431045, en, Deadpool tells the origin story of former Special Forces operative turned mercenary Wade Wilson, who after being subjected to a rogue experiment that leaves him with accelerated healing powers, adopts the alter ego Deadpool. Armed with his new abilities and a dark, twisted sense of humor, Deadpool hunts down the man who nearly destroyed his life., https://image.tmdb.org/t/p/w342//yGSxMiF0cYuAiyuve5DA6bnWEOI.jpg, 9.9, 2016-02-09T00:00:00, 7.831E8, 108, Witness the beginning of a happy ending, Deadpool, https://www.themoviedb.org/movie/293660, null, null, null)"
"List(https://image.tmdb.org/t/p/original//kwUQFeFXOOpgloMgZaadhzkbTI4.jpg, 2.2E8, null, 2021-04-03T16:51:30.1666667, 5, https://www.imdb.com/title/tt0848228, en, When an unexpected enemy emerges and threatens global safety and security, Nick Fury, director of the international peacekeeping agency known as S.H.I.E.L.D., finds himself in need of a team to pull the world back from the brink of disaster. Spanning the globe, a daring recruitment effort begins!, https://image.tmdb.org/t/p/w342//RYMX2wcKCBAr24UyPD7xwmjaTn.jpg, 9.9, 2012-04-25T00:00:00, 1.51955791E9, 143, Some assembly required., The Avengers, https://www.themoviedb.org/movie/24428, null, null, null)"
"List(https://image.tmdb.org/t/p/original//AmHOQ7rpHwiaUMRjKXztnauSJb7.jpg, 2.37E8, null, 2021-04-03T16:51:30.1666667, 6, https://www.imdb.com/title/tt0499549, en, In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization., https://image.tmdb.org/t/p/w342//6EiRUJpuoeQPghrs3YNktfnqOVh.jpg, 9.9, 2009-12-10T00:00:00, 2.787965087E9, 162, Enter the World of Pandora., Avatar, https://www.themoviedb.org/movie/19995, null, null, null)"
"List(https://image.tmdb.org/t/p/original//mZSAu5acXueGC4Z3S5iLSWx8AEp.jpg, 1.7E8, null, 2021-04-03T16:51:30.1666667, 7, https://www.imdb.com/title/tt2015381, en, Light years from Earth, 26 years after being abducted, Peter Quill finds himself the prime target of a manhunt after discovering an orb wanted by Ronan the Accuser., https://image.tmdb.org/t/p/w342//r7vmZjiyZw9rpJMQJdXpjgiCOk9.jpg, 9.9, 2014-07-30T00:00:00, 7.727766E8, 121, All heroes start somewhere., Guardians of the Galaxy, https://www.themoviedb.org/movie/118340, null, null, null)"
"List(https://image.tmdb.org/t/p/original//52AfXWuXCHn3UjD17rBruA9f5qb.jpg, 6.3E7, null, 2021-04-03T16:51:30.1666667, 8, https://www.imdb.com/title/tt0137523, en, A ticking-time-bomb insomniac and a slippery soap salesman channel primal male aggression into a shocking new form of therapy. Their concept catches on, with underground ""fight clubs"" forming in every town, until an eccentric gets in the way and ignites an out-of-control spiral toward oblivion., https://image.tmdb.org/t/p/w342//8kNruSfhk5IoE4eZOc4UpvDn6tq.jpg, 9.9, 1999-10-15T00:00:00, 1.00853753E8, 139, Mischief. Mayhem. Soap., Fight Club, https://www.themoviedb.org/movie/550, null, null, null)"
"List(https://image.tmdb.org/t/p/original//lmZFxXgJE3vgrciwuDib0N8CfQo.jpg, 3.0E8, null, 2021-04-03T16:51:30.1666667, 9, https://www.imdb.com/title/tt4154756, en, As the Avengers and their allies have continued to protect the world from threats too large for any one hero to handle, a new danger has emerged from the cosmic shadows: Thanos. A despot of intergalactic infamy, his goal is to collect all six Infinity Stones, artifacts of unimaginable power, and use them to inflict his twisted will on all of reality. Everything the Avengers have fought for has led up to this moment - the fate of Earth and existence itself has never been more uncertain., https://image.tmdb.org/t/p/w342//7WsyChQLEftFiDOVTGkv3hFpyyt.jpg, 9.9, 2018-04-25T00:00:00, 2.046239637E9, 149, An entire universe. Once and for all., Avengers: Infinity War, https://www.themoviedb.org/movie/299536, null, null, null)"
"List(https://image.tmdb.org/t/p/original//w7RDIgQM6bLT7JXtH4iUQd3Iwxm.jpg, 8000000.0, null, 2021-04-03T16:51:30.1666667, 10, https://www.imdb.com/title/tt0110912, en, A burger-loving hit man, his philosophical partner, a drug-addled gangster's moll and a washed-up boxer converge in this sprawling, comedic crime caper. Their adventures unfurl in three stories that ingeniously trip back and forth in time., https://image.tmdb.org/t/p/w342//plnlrtBUULT0rh3Xsjmpubiso3L.jpg, 9.9, 1994-09-10T00:00:00, 2.14179088E8, 154, Just because you are a character doesn't mean you have character., Pulp Fiction, https://www.themoviedb.org/movie/680, null, null, null)"


## Ingestion Metadata

As part of the ingestion process, we record metadata for the ingestion.

**EXERCISE:** Add metadata to the incoming raw data. You should add the following columns:

- data source (`datasource`), use `"files.training.databricks.com"`
- ingestion time (`ingesttime`)
- status (`status`), use `"new"`
- ingestion date (`ingestdate`)

In [0]:
# TODO
from pyspark.sql.functions import current_timestamp, lit

raw_movie_data_df = (
  movie_data_df.select(
     "col",
    lit("antra").alias("datasource"),
    current_timestamp().alias("ingesttime"),
    lit("new").alias("status"),
    current_timestamp().cast("date").alias("ingestdate"),
  )
)

## WRITE Batch to a Bronze Table

Finally, we write to the Bronze Table.

Make sure to write in the correct order (`"datasource"`, `"ingesttime"`, `"value"`, `"status"`, `"p_ingestdate"`).

Make sure to use following options:

- the format `"delta"`
- using the append mode
- partition by `p_ingestdate`

In [0]:
# TODO
from pyspark.sql.functions import col
bronzePath = "/FileStore/movieBronze"
(
    raw_movie_data_df.select(
        "datasource",
        "ingesttime",
        "col",
        "status",
        col("ingestdate").alias("p_ingestdate"),
    )
    .write.format("delta")
    .mode("append")
    .partitionBy("p_ingestdate")
    .save(bronzePath)
)

In [0]:
display(dbutils.fs.ls(bronzePath))

path,name,size,modificationTime
dbfs:/FileStore/movieBronze/_delta_log/,_delta_log/,0,1661238544000
dbfs:/FileStore/movieBronze/p_ingestdate=2022-08-23/,p_ingestdate=2022-08-23/,0,1661238541000


## Register the Bronze Table in the Metastore

The table should be named `health_tracker_classic_bronze`.

In [0]:
# TODO
spark.sql(
    """
DROP TABLE IF EXISTS raw_movie_data_clissic_bronze
"""
)

spark.sql(
    f"""
CREATE TABLE raw_movie_data_clissic_bronze
USING DELTA
LOCATION "{bronzePath}"
"""
)

## Display Classic Bronze Table

Run this query to display the contents of the Classic Bronze Table

In [0]:
%sql

SELECT * 
FROM raw_movie_data_clissic_bronze
LIMIT 10;

datasource,ingesttime,col,status,p_ingestdate
antra,2022-08-23T06:49:06.048+0000,"List(https://image.tmdb.org/t/p/original//s3TBrRGB1iav7gFOCNx3H31MoES.jpg, 1.6E8, null, 2021-04-03T16:51:30.1633333, 1, https://www.imdb.com/title/tt1375666, en, Cobb, a skilled thief who commits corporate espionage by infiltrating the subconscious of his targets is offered a chance to regain his old life as payment for a task considered to be impossible: ""inception"", the implantation of another person's idea into a target's subconscious., https://image.tmdb.org/t/p/w342//9gk7adHYeDvHkCSEqAvQNLV5Uge.jpg, 9.9, 2010-07-15T00:00:00, 8.25532764E8, 148, Your mind is the scene of the crime., Inception, https://www.themoviedb.org/movie/27205, null, null, null)",new,2022-08-23
antra,2022-08-23T06:49:06.048+0000,"List(https://image.tmdb.org/t/p/original//xJHokMbljvjADYdit5fK5VQsXEG.jpg, 1.65E8, null, 2021-04-03T16:51:30.1633333, 2, https://www.imdb.com/title/tt0816692, en, The adventures of a group of explorers who make use of a newly discovered wormhole to surpass the limitations on human space travel and conquer the vast distances involved in an interstellar voyage., https://image.tmdb.org/t/p/w342//gEU2QniE6E77NI6lCU6MxlNBvIx.jpg, 9.9, 2014-11-05T00:00:00, 6.75120017E8, 169, Mankind was born on Earth. It was never meant to die here., Interstellar, https://www.themoviedb.org/movie/157336, null, null, null)",new,2022-08-23
antra,2022-08-23T06:49:06.048+0000,"List(https://image.tmdb.org/t/p/original//hkBaDkMWbLaf8B1lsWsKX7Ew3Xq.jpg, 1.85E8, null, 2021-04-03T16:51:30.1633333, 3, https://www.imdb.com/title/tt0468569, en, Batman raises the stakes in his war on crime. With the help of Lt. Jim Gordon and District Attorney Harvey Dent, Batman sets out to dismantle the remaining criminal organizations that plague the streets. The partnership proves to be effective, but they soon find themselves prey to a reign of chaos unleashed by a rising criminal mastermind known to the terrified citizens of Gotham as the Joker., https://image.tmdb.org/t/p/w342//qJ2tW6WMUDux911r6m7haRef0WH.jpg, 9.9, 2008-07-16T00:00:00, 1.004558444E9, 152, Why So Serious?, The Dark Knight, https://www.themoviedb.org/movie/155, null, null, null)",new,2022-08-23
antra,2022-08-23T06:49:06.048+0000,"List(https://image.tmdb.org/t/p/original//en971MEXui9diirXlogOrPKmsEn.jpg, 5.8E7, null, 2021-04-03T16:51:30.1633333, 4, https://www.imdb.com/title/tt1431045, en, Deadpool tells the origin story of former Special Forces operative turned mercenary Wade Wilson, who after being subjected to a rogue experiment that leaves him with accelerated healing powers, adopts the alter ego Deadpool. Armed with his new abilities and a dark, twisted sense of humor, Deadpool hunts down the man who nearly destroyed his life., https://image.tmdb.org/t/p/w342//yGSxMiF0cYuAiyuve5DA6bnWEOI.jpg, 9.9, 2016-02-09T00:00:00, 7.831E8, 108, Witness the beginning of a happy ending, Deadpool, https://www.themoviedb.org/movie/293660, null, null, null)",new,2022-08-23
antra,2022-08-23T06:49:06.048+0000,"List(https://image.tmdb.org/t/p/original//kwUQFeFXOOpgloMgZaadhzkbTI4.jpg, 2.2E8, null, 2021-04-03T16:51:30.1666667, 5, https://www.imdb.com/title/tt0848228, en, When an unexpected enemy emerges and threatens global safety and security, Nick Fury, director of the international peacekeeping agency known as S.H.I.E.L.D., finds himself in need of a team to pull the world back from the brink of disaster. Spanning the globe, a daring recruitment effort begins!, https://image.tmdb.org/t/p/w342//RYMX2wcKCBAr24UyPD7xwmjaTn.jpg, 9.9, 2012-04-25T00:00:00, 1.51955791E9, 143, Some assembly required., The Avengers, https://www.themoviedb.org/movie/24428, null, null, null)",new,2022-08-23
antra,2022-08-23T06:49:06.048+0000,"List(https://image.tmdb.org/t/p/original//AmHOQ7rpHwiaUMRjKXztnauSJb7.jpg, 2.37E8, null, 2021-04-03T16:51:30.1666667, 6, https://www.imdb.com/title/tt0499549, en, In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization., https://image.tmdb.org/t/p/w342//6EiRUJpuoeQPghrs3YNktfnqOVh.jpg, 9.9, 2009-12-10T00:00:00, 2.787965087E9, 162, Enter the World of Pandora., Avatar, https://www.themoviedb.org/movie/19995, null, null, null)",new,2022-08-23
antra,2022-08-23T06:49:06.048+0000,"List(https://image.tmdb.org/t/p/original//mZSAu5acXueGC4Z3S5iLSWx8AEp.jpg, 1.7E8, null, 2021-04-03T16:51:30.1666667, 7, https://www.imdb.com/title/tt2015381, en, Light years from Earth, 26 years after being abducted, Peter Quill finds himself the prime target of a manhunt after discovering an orb wanted by Ronan the Accuser., https://image.tmdb.org/t/p/w342//r7vmZjiyZw9rpJMQJdXpjgiCOk9.jpg, 9.9, 2014-07-30T00:00:00, 7.727766E8, 121, All heroes start somewhere., Guardians of the Galaxy, https://www.themoviedb.org/movie/118340, null, null, null)",new,2022-08-23
antra,2022-08-23T06:49:06.048+0000,"List(https://image.tmdb.org/t/p/original//52AfXWuXCHn3UjD17rBruA9f5qb.jpg, 6.3E7, null, 2021-04-03T16:51:30.1666667, 8, https://www.imdb.com/title/tt0137523, en, A ticking-time-bomb insomniac and a slippery soap salesman channel primal male aggression into a shocking new form of therapy. Their concept catches on, with underground ""fight clubs"" forming in every town, until an eccentric gets in the way and ignites an out-of-control spiral toward oblivion., https://image.tmdb.org/t/p/w342//8kNruSfhk5IoE4eZOc4UpvDn6tq.jpg, 9.9, 1999-10-15T00:00:00, 1.00853753E8, 139, Mischief. Mayhem. Soap., Fight Club, https://www.themoviedb.org/movie/550, null, null, null)",new,2022-08-23
antra,2022-08-23T06:49:06.048+0000,"List(https://image.tmdb.org/t/p/original//lmZFxXgJE3vgrciwuDib0N8CfQo.jpg, 3.0E8, null, 2021-04-03T16:51:30.1666667, 9, https://www.imdb.com/title/tt4154756, en, As the Avengers and their allies have continued to protect the world from threats too large for any one hero to handle, a new danger has emerged from the cosmic shadows: Thanos. A despot of intergalactic infamy, his goal is to collect all six Infinity Stones, artifacts of unimaginable power, and use them to inflict his twisted will on all of reality. Everything the Avengers have fought for has led up to this moment - the fate of Earth and existence itself has never been more uncertain., https://image.tmdb.org/t/p/w342//7WsyChQLEftFiDOVTGkv3hFpyyt.jpg, 9.9, 2018-04-25T00:00:00, 2.046239637E9, 149, An entire universe. Once and for all., Avengers: Infinity War, https://www.themoviedb.org/movie/299536, null, null, null)",new,2022-08-23
antra,2022-08-23T06:49:06.048+0000,"List(https://image.tmdb.org/t/p/original//w7RDIgQM6bLT7JXtH4iUQd3Iwxm.jpg, 8000000.0, null, 2021-04-03T16:51:30.1666667, 10, https://www.imdb.com/title/tt0110912, en, A burger-loving hit man, his philosophical partner, a drug-addled gangster's moll and a washed-up boxer converge in this sprawling, comedic crime caper. Their adventures unfurl in three stories that ingeniously trip back and forth in time., https://image.tmdb.org/t/p/w342//plnlrtBUULT0rh3Xsjmpubiso3L.jpg, 9.9, 1994-09-10T00:00:00, 2.14179088E8, 154, Just because you are a character doesn't mean you have character., Pulp Fiction, https://www.themoviedb.org/movie/680, null, null, null)",new,2022-08-23


### Query Broken Records


Run a SQL query to display just the incoming records for "Gonzalo Valdés".

🧠 You can use the SQL operator `RLIKE`, which is short for regex `LIKE`,
to create your matching predicate.

[`RLIKE` documentation](https://docs.databricks.com/spark/latest/spark-sql/language-manual/functions.html#rlike)

In [0]:
%sql

SELECT * FROM health_tracker_classic_bronze WHERE value RLIKE 'Gonzalo Valdés'

### What do you notice?

### Display the User Dimension Table


Run a SQL query to display the records in `health_tracker_user`.

In [0]:
%sql

SELECT * FROM health_tracker_user

## Purge Raw File Path

We have loaded the raw files using batch loading, whereas with the Plus pipeline we used Streaming.

The impact of this is that batch does not use checkpointing and therefore does not know which files have been ingested.

We need to manually purge the raw files that have been loaded.

In [0]:
#dbutils.fs.rm(rawPath, recurse=True)


-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>